Output formats - GraphRAG

The default pipeline produces a series of output tables that align with the GraphRAG knowledge model. By default, these tables are written as Parquet files to disk.

All output tables include embeddings written directly to your configured vector store for efficient downstream retrieval.

Shared fields

All tables have two identifier fields for global uniqueness and human readability:

Field	Type	Description
`id`	str	Generated UUID, ensuring global uniqueness across all records
`human_readable_id`	int	Incremented short ID created per-run. Used in generated summaries with citations for easy visual cross-reference

Communities

This table contains the final communities generated by the Leiden algorithm. Communities are strictly hierarchical, subdividing into children as cluster affinity is narrowed.

Field	Type	Description
`community`	int	Leiden-generated cluster ID for the community. These increment with depth and are unique through all levels of the hierarchy. For this table, `human_readable_id` is a copy of the community ID
`parent`	int	Parent community ID
`children`	int[]	List of child community IDs
`level`	int	Depth of the community in the hierarchy
`title`	str	Friendly name of the community
`entity_ids`	str[]	List of entities that are members of the community
`relationship_ids`	str[]	List of relationships wholly within the community (source and target both in community)
`text_unit_ids`	str[]	List of text units represented within the community
`period`	str	Date of ingest in ISO8601 format, used for incremental update merges
`size`	int	Size of the community (entity count), used for incremental update merges

Example communities.parquet

import pandas as pd

communities = pd.read_parquet("output/communities.parquet")
print(communities.head())

# Sample output:
#   id                    community  parent  children      level  title               entity_ids          relationship_ids
#   abc123-def456-...     0          -1      [1, 2, 3]     0      Community 0         [ent1, ent2, ...]  [rel1, rel2, ...]
#   def456-ghi789-...     1          0       []            1      Community 1         [ent3, ent4, ...]  [rel3, rel4, ...]

Community reports

This table contains the summarized reports for each community, generated by the LLM.

Field	Type	Description
`community`	int	Short ID of the community this report applies to
`parent`	int	Parent community ID
`children`	int[]	List of child community IDs
`level`	int	Level of the community this report applies to
`title`	str	LLM-generated title for the report
`summary`	str	LLM-generated summary of the report
`full_content`	str	LLM-generated full report
`rank`	float	LLM-derived relevance ranking based on member entity salience
`rating_explanation`	str	LLM-derived explanation of the rank
`findings`	dict	LLM-derived list of the top 5-10 insights from the community. Contains `summary` and `explanation` values
`full_content_json`	json	Full JSON output as returned by the LLM. Most fields are extracted into columns, but this JSON is sent for query summarization to allow prompt tuning to add fields/content
`period`	str	Date of ingest in ISO8601 format, used for incremental update merges
`size`	int	Size of the community (entity count), used for incremental update merges

Example community_reports.parquet

import pandas as pd

reports = pd.read_parquet("output/community_reports.parquet")
print(reports[['community', 'title', 'summary']].head())

# Sample output:
#   community  title                           summary
#   0          Global Technology Ecosystem     This community represents major technology companies...
#   1          Social Media Platforms          A focused group of social networking services...

Covariates

This optional table is generated when claim extraction is enabled. Claims typically identify malicious behavior such as fraud, so they are not useful for all datasets.

Claim extraction is off by default and requires configuration to enable.

Field	Type	Description
`covariate_type`	str	Always “claim” with default covariates
`type`	str	Nature of the claim type
`description`	str	LLM-generated description of the behavior
`subject_id`	str	Name of the source entity (performing the claimed behavior)
`object_id`	str	Name of the target entity (behavior is performed on)
`status`	str	LLM-derived assessment of correctness. One of: `TRUE`, `FALSE`, `SUSPECTED`
`start_date`	str	LLM-derived start of the claimed activity (ISO8601)
`end_date`	str	LLM-derived end of the claimed activity (ISO8601)
`source_text`	str	Short string of text containing the claimed behavior
`text_unit_id`	str	ID of the text unit the claim was extracted from

Example covariates.parquet

import pandas as pd

covariates = pd.read_parquet("output/covariates.parquet")
print(covariates[['subject_id', 'type', 'status', 'description']].head())

# Sample output:
#   subject_id    type           status      description
#   Company A     ACQUISITION    TRUE        Company A acquired Company B for $10B
#   Person X      FRAUD          SUSPECTED   Person X allegedly misused funds

Documents

This table contains the list of document content after import.

Field	Type	Description
`title`	str	Filename, unless otherwise configured during CSV/JSON import
`text`	str	Full text of the document
`text_unit_ids`	str[]	List of text units (chunks) that were parsed from the document
`metadata`	dict	If specified during CSV/JSON import, this is a dict of metadata for the document

Example documents.parquet

import pandas as pd

documents = pd.read_parquet("output/documents.parquet")
print(documents[['title', 'text_unit_ids']].head())

# Sample output:
#   title               text_unit_ids
#   article1.txt        [unit1, unit2, unit3]
#   article2.txt        [unit4, unit5]

Entities

This table contains all entities found in the data by the LLM.

Field	Type	Description
`title`	str	Name of the entity
`type`	str	Type of the entity. By default: “organization”, “person”, “geo”, or “event” (unless configured differently or auto-tuning is used)
`description`	str	Textual description of the entity. Since entities may be found in many text units, this is an LLM-derived summary of all descriptions
`text_unit_ids`	str[]	List of the text units containing the entity
`frequency`	int	Count of text units the entity was found within
`degree`	int	Node degree (connectedness) in the graph

Example entities.parquet

import pandas as pd

entities = pd.read_parquet("output/entities.parquet")
print(entities[['title', 'type', 'description', 'degree']].head())

# Sample output:
#   title              type          description                                      degree
#   Microsoft          organization  A multinational technology corporation...        42
#   Satya Nadella      person        CEO of Microsoft Corporation...                  18
#   Seattle            geo           City in Washington state, headquarters...        15

Relationships

This table contains all entity-to-entity relationships found in the data by the LLM. This is also the edge list for the graph.

Field	Type	Description
`source`	str	Name of the source entity
`target`	str	Name of the target entity
`description`	str	LLM-derived description of the relationship. Like entity descriptions, this is summarized from multiple instances
`weight`	float	Weight of the edge in the graph. Summed from an LLM-derived “strength” measure for each relationship instance
`combined_degree`	int	Sum of source and target node degrees
`text_unit_ids`	str[]	List of text units the relationship was found within

Example relationships.parquet

import pandas as pd

relationships = pd.read_parquet("output/relationships.parquet")
print(relationships[['source', 'target', 'description', 'weight']].head())

# Sample output:
#   source          target           description                          weight
#   Microsoft       Azure            Microsoft develops and operates...   0.95
#   Satya Nadella   Microsoft        Satya Nadella serves as CEO of...   0.98
#   Microsoft       OpenAI           Microsoft has invested in and...    0.87

Text units

This table contains all text chunks parsed from the input documents.

Field	Type	Description
`text`	str	Raw full text of the chunk
`n_tokens`	int	Number of tokens in the chunk. Should normally match the `chunk_size` config parameter, except for the last chunk which is often shorter
`document_id`	str	ID of the document the chunk came from
`entity_ids`	str[]	List of entities found in the text unit
`relationship_ids`	str[]	List of relationships found in the text unit
`covariate_ids`	str[]	Optional list of covariates found in the text unit

Example text_units.parquet

import pandas as pd

text_units = pd.read_parquet("output/text_units.parquet")
print(text_units[['text', 'n_tokens', 'entity_ids']].head())

# Sample output:
#   text                                          n_tokens  entity_ids
#   Microsoft Corporation is a technology...      1200      [Microsoft, Bill Gates, ...]
#   The company was founded in 1975...            1200      [Microsoft, Paul Allen, ...]
#   Azure is Microsoft's cloud computing...       850       [Azure, Microsoft, ...]

Working with Parquet files

import pandas as pd

# Read a single table
entities = pd.read_parquet("output/entities.parquet")
relationships = pd.read_parquet("output/relationships.parquet")

# Filter and analyze
high_degree_entities = entities[entities['degree'] > 10]
print(f"Found {len(high_degree_entities)} highly connected entities")

import pandas as pd
import networkx as nx

# Load entities and relationships
entities = pd.read_parquet("output/entities.parquet")
relationships = pd.read_parquet("output/relationships.parquet")

# Create graph
G = nx.Graph()

# Add nodes
for _, entity in entities.iterrows():
    G.add_node(
        entity['title'],
        type=entity['type'],
        description=entity['description']
    )

# Add edges
for _, rel in relationships.iterrows():
    G.add_edge(
        rel['source'],
        rel['target'],
        weight=rel['weight'],
        description=rel['description']
    )

print(f"Graph has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")

import duckdb

# Query Parquet files directly with SQL
conn = duckdb.connect()

# Find all technology companies
query = """
    SELECT title, description, degree
    FROM 'output/entities.parquet'
    WHERE type = 'organization'
    AND description LIKE '%technology%'
    ORDER BY degree DESC
    LIMIT 10
"""

results = conn.execute(query).fetchdf()
print(results)

Storage locations

By default, Parquet files are written to the output directory specified in your configuration:

settings.yaml

storage:
  type: file
  base_dir: "output"

Local filesystem
Azure Blob Storage
Custom storage

storage:
  type: file
  base_dir: "output"

Files are written to:

output/entities.parquet
output/relationships.parquet
output/communities.parquet
etc.

storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: "graphrag-output"

Files are written to the configured container.

Implement your own storage provider using the factory pattern:

from graphrag.storage.factory import StorageFactory

StorageFactory.register("s3", MyS3Storage)

Next steps

Custom graphs

Learn how to bring your own existing graph data

Querying

Use the output tables for GraphRAG queries

Configuration

Configure storage providers and output settings

​Shared fields

​Communities

​Community reports

​Covariates

​Documents

​Entities

​Relationships

​Text units

​Working with Parquet files

​Storage locations

​Next steps

Custom graphs

Querying

Configuration

Shared fields

Communities

Community reports

Covariates

Documents

Entities

Relationships

Text units

Working with Parquet files

Storage locations

Next steps