Use this file to discover all available pages before exploring further.
GraphRAG excels at processing large collections of research documents to extract entities, relationships, and insights. This guide demonstrates how to use GraphRAG for academic and scientific research analysis.
Gather your research documents in a structured format:
input/papers.csv
id,title,abstract,authors,year,venue,citations,keywords1,"Attention Is All You Need","The dominant...","Vaswani et al.",2017,"NeurIPS",45000,"transformers;attention;neural networks"2,"BERT: Pre-training of Deep...","We introduce...","Devlin et al.",2018,"NAACL",35000,"language models;BERT;NLP"3,"Language Models are Few-Shot...","Recent work has...","Brown et al.",2020,"NeurIPS",12000,"GPT-3;language models;few-shot"
2
Create custom prompts
Define research-specific entities and relationships:
prompts/research_entity_extraction.txt
-Goal-Extract scientific entities and relationships from research papers.-Entity Types-- RESEARCHER: Authors and cited researchers (e.g., "Vaswani", "Devlin")- INSTITUTION: Universities, research labs, companies (e.g., "Google", "MIT")- CONCEPT: Scientific concepts, theories, methods (e.g., "attention mechanism", "transformers")- MODEL: Specific models or systems (e.g., "BERT", "GPT-3", "ResNet")- DATASET: Training or evaluation datasets (e.g., "ImageNet", "GLUE")- METRIC: Performance metrics (e.g., "accuracy", "BLEU score")- TASK: Research tasks or problems (e.g., "machine translation", "image classification")-Relationship Types-- AUTHORED: Researcher authored paper- AFFILIATED_WITH: Researcher at institution- INTRODUCES: Paper introduces concept/model- USES: Paper uses method/dataset- IMPROVES_ON: Model improves on previous model- EVALUATED_ON: Model evaluated on dataset/task- CITES: Paper cites other work- APPLIES_TO: Concept applies to task-Instructions-1. Identify all entities in the abstract and paper text2. Preserve exact names for researchers, models, and datasets3. Extract key concepts even if not explicitly named4. Link researchers to their institutions5. Connect models to the concepts they use and tasks they address
3
Configure GraphRAG
Update settings.yaml for research corpus:
settings.yaml
input: type: csv file_pattern: .*\.csv$ id_column: id title_column: title text_column: abstractchunking: size: 600 # Larger chunks for academic text overlap: 100 prepend_metadata: ["authors", "year", "venue", "keywords"]entity_extraction: prompt: prompts/research_entity_extraction.txt entity_types: [RESEARCHER, INSTITUTION, CONCEPT, MODEL, DATASET, METRIC, TASK]community_reports: prompt: prompts/research_community_report.txt
# Trace research lineageresult = await drift_search.search( "How did the transformer architecture influence modern language models like GPT and BERT?")# Cross-domain connectionsresult = await drift_search.search( "How have computer vision techniques influenced natural language processing?")# Collaboration networksresult = await drift_search.search( "How are researchers at Google and OpenAI connected through co-authors and citations?")
import pandas as pdimport matplotlib.pyplot as plt# Load entities with temporal dataentities = pd.read_parquet('./output/entities.parquet')# Filter for CONCEPT entitiesconcepts = entities[entities['type'] == 'CONCEPT']# Analyze concept emergence by year# (assuming year is in entity metadata)concept_timeline = concepts.groupby(['name', 'year']).size().unstack(fill_value=0)# Plot concept trendsconcept_timeline.T.plot(figsize=(12, 6))plt.title('Emergence of Research Concepts Over Time')plt.xlabel('Year')plt.ylabel('Mentions')plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')plt.tight_layout()plt.show()
# Generate comprehensive literature reviewgraphrag query \ "Provide a structured literature review of transformer-based language models, \ including key papers, methodological evolution, and current state of the art" \ --method global
# Identify underexplored connectionsresult = await drift_search.search( "What concepts are frequently mentioned together but lack direct research connecting them?")print(result.response)
{ "entities": [ { "name": "BERT", "type": "MODEL", "description": "Bidirectional Encoder Representations from Transformers, a pre-trained language model" }, { "name": "Jacob Devlin", "type": "RESEARCHER", "description": "Researcher at Google AI Language, lead author of BERT paper" }, { "name": "masked language modeling", "type": "CONCEPT", "description": "Training objective that masks tokens and predicts them from context" } ], "relationships": [ { "source": "Jacob Devlin", "target": "BERT", "description": "authored and introduced" }, { "source": "BERT", "target": "masked language modeling", "description": "uses as primary training objective" } ]}