GraphRAG supports several input formats to simplify ingesting your data. This page discusses the mechanics and features available for input files and text chunking.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/microsoft/graphrag/llms.txt
Use this file to discover all available pages before exploring further.
Input loading and schema
All input formats are loaded within GraphRAG and passed to the indexing pipeline as adocuments DataFrame. This DataFrame has a row for each document using a shared column schema:
| Column | Type | Description |
|---|---|---|
id | str | ID of the document. Generated using a hash of the text content to ensure stability across runs. |
text | str | The full text of the document. |
title | str | Name of the document. Some formats allow this to be configured. |
creation_date | str | The creation date of the document, represented as an ISO8601 string. Harvested from the source file system. |
metadata | dict | Optional additional document metadata. |
See the outputs documentation for the final documents table schema saved to Parquet after pipeline completion.
Bring your own DataFrame
GraphRAG’s indexing API allows you to pass in your own pandas DataFrame and bypass all input loading/parsing.Custom file handling
GraphRAG uses an injectableInputReader provider class. You can implement any input file handling you want in a class that extends InputReader and register it with the InputReaderFactory.
See the architecture page for more info on the standard provider pattern.
Supported formats
GraphRAG supports three file formats out-of-the-box, covering the overwhelming majority of use cases.- Plain text
- CSV
- JSON
Plain text files (typically ending in Configuration:
.txt file extension).- The entire file contents become the
textfield - The
titleis always the filename - Simplest format for getting started
article.txt
settings.yaml
Metadata
With structured file formats (CSV and JSON), you can configure any number of columns to be added to a persistedmetadata field in the DataFrame.
Configuration
settings.yaml
metadata column will have a dict containing a key for each column and the value of that column for that document.
Example
- Input
- Output
software.csv:settings.yaml:
Chunking and metadata
As described on the dataflow page, documents are chunked into smaller “text units” for processing because document content size often exceeds the available context window for language models.Chunking configuration
settings.yaml
Metadata prepending
Imagine indexing a collection of news articles where each article starts with a headline and author. When documents are chunked, they are split evenly according to your configured chunk size. When you later retrieve those chunks for summarization, they may be missing shared information about the source document.Solution: prepend metadata
You can configure the chunker to copy metadata into each text chunk:Configure metadata columns
Specify which columns to include as metadata during document import.
settings.yaml
key: value pairs on new lines at the beginning of each chunk.
Chunking examples
- Text files with metadata
- JSON with overlap
Input files:Configuration:Result chunks:
US to lift most federal COVID-19 vaccine mandates.txt:settings.yaml
The title (filename) is prepended to each chunk but not included in the computed chunk size.
Best practices
Choosing chunk size
Choosing chunk size
- 1200 tokens (default): Good balance for most use cases
- 300-600 tokens: Better for precise entity extraction
- 50-100 tokens: Recommended for FastGraphRAG
- Consider your model’s context window
Using metadata
Using metadata
- Include metadata that provides context across all chunks
- Use
prepend_metadatawhen chunks need document-level context - Common metadata: title, author, date, category, source
Choosing overlap
Choosing overlap
- 0 tokens: Fastest processing, no redundancy
- 50-100 tokens: Better context preservation
- 10-20%: Good rule of thumb (e.g., 100 tokens for 1000 token chunks)
File format selection
File format selection
- Text: Simplest, best for unstructured content
- CSV: Best for structured data with metadata
- JSON: Best for complex nested metadata
- Use custom DataFrame for unsupported formats
Next steps
Outputs
Learn about the Parquet output formats
Data flow
See how inputs flow through the pipeline
Configuration
Configure all indexing parameters