Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/microsoft/graphrag/llms.txt

Use this file to discover all available pages before exploring further.

GraphRAG supports processing and querying documents in multiple languages by leveraging multilingual language models. This guide shows you how to configure GraphRAG for non-English content and mixed-language datasets.

Language model selection

The key to multi-lingual support is choosing models that support your target languages:

GPT-4 and GPT-4 Turbo

Excellent support for 50+ languages including European, Asian, and Middle Eastern languages

GPT-3.5 Turbo

Good support for major languages, more cost-effective for large-scale processing

Multilingual embeddings

text-embedding-3-small and text-embedding-3-large support 100+ languages

Azure OpenAI

Same language support with additional compliance and regional deployment options

Basic configuration

No special configuration is needed for most languages. The default settings work well:
settings.yaml
completion_models:
  default_completion_model:
    model_provider: openai
    model: gpt-4  # Supports multiple languages out of the box
    api_key: ${GRAPHRAG_API_KEY}

embedding_models:
  default_embedding_model:
    model_provider: openai
    model: text-embedding-3-small  # Multilingual embeddings
    api_key: ${GRAPHRAG_API_KEY}

Language-specific prompt tuning

For better results, customize prompts in your target language:
prompts/entity_extraction_es.txt
-Objetivo-
Dado un documento de texto, identifique todas las entidades y sus relaciones.

-Pasos-
1. Identifique todas las entidades en el texto
2. Para cada entidad, extraiga los atributos relevantes
3. Identifique las relaciones entre entidades
4. Formatee la salida según lo especificado

-Tipos de Entidad-
- PERSONA: Individuos humanos
- ORGANIZACIÓN: Empresas, instituciones
- UBICACIÓN: Lugares físicos o virtuales
- EVENTO: Sucesos significativos

Processing mixed-language documents

When working with documents in multiple languages:
1

Organize by language

Optionally separate documents by language for better tracking:
input/
├── en/
   ├── document1.txt
   └── document2.txt
├── es/
   ├── documento1.txt
   └── documento2.txt
└── fr/
    ├── document1.txt
    └── document2.txt
2

Use language-agnostic prompts

For truly mixed content, use English prompts with instructions to handle multiple languages:
-Goal-
Given a text document that may contain content in multiple languages,
identify all entities and relationships. Preserve entity names in their
original language and provide descriptions in the source language.

-Instructions-
- Maintain entity names in original language
- Generate descriptions in the same language as the source text
- Identify language of each extracted entity
3

Configure language metadata

Track language information in your chunking configuration:
chunking:
  prepend_metadata: ["language", "source_file"]

Querying in multiple languages

GraphRAG supports queries in different languages:
graphrag query "What are the main themes in this dataset?"
The LLM will attempt to respond in the same language as your query. For mixed-language datasets, you can specify the desired response language in your query.

Best practices for specific languages

  • Use larger chunk sizes due to character density
  • Consider using gpt-4 for better understanding of classical vs. modern Chinese
  • Test entity extraction with both simplified and traditional characters
chunking:
  size: 800  # Increase from default 400
  overlap: 200  # Increase overlap proportionally
  • Account for mixed scripts (Hiragana, Katakana, Kanji)
  • Use character-based rather than token-based chunking
  • Test entity extraction with company names (often use Kanji)
chunking:
  type: sentence  # Better for Japanese text
  size: 600
  • Ensure text encoding is UTF-8
  • Be aware that some entity names may be transliterated
  • Test with mixed RTL/LTR content (common in technical documents)
input:
  encoding: utf-8
  • Long compound words may need special handling
  • Consider noun capitalization in entity extraction
  • Adjust chunk sizes for longer words
chunking:
  size: 500  # Slightly larger for compound words

Example: Multi-lingual research corpus

Here’s a complete example for processing academic papers in multiple languages:
1

Prepare data

Organize papers with language metadata:
input/papers.csv
id,title,text,language,author,year
1,"Machine Learning Basics","Full text...",en,"Smith",2023
2,"Apprentissage Automatique","Texte complet...",fr,"Dubois",2023
3,"機械学習の基礎","全文...",ja,"田中",2023
2

Configure for CSV input

settings.yaml
input:
  type: csv
  file_pattern: .*\.csv$
  id_column: id
  title_column: title
  text_column: text

chunking:
  prepend_metadata: ["language", "author", "year"]
  size: 600
  overlap: 100
3

Create multilingual prompts

prompts/entity_extraction_multilingual.txt
-Goal-
Extract academic entities from research papers in any language.
Preserve technical terms and names in their original language.

-Entity Types-
- RESEARCHER: Authors and cited researchers
- CONCEPT: Scientific concepts and methods
- INSTITUTION: Universities and research organizations
- PUBLICATION: Papers, journals, conferences

-Instructions-
- Keep researcher names in original form
- Preserve technical terminology in source language
- Translate descriptions to match source document language
4

Query across languages

# Query in English
graphrag query "What machine learning concepts are discussed?" --method global

# Query in French
graphrag query "Quels chercheurs sont mentionnés?" --method local

# Query in Japanese  
graphrag query "どの研究機関が関与していますか?" --method local

Language detection and routing

For advanced use cases, implement language detection:
from langdetect import detect
import pandas as pd

# Detect language in documents
df = pd.read_csv('input/documents.csv')
df['detected_language'] = df['text'].apply(detect)

# Route to language-specific prompts
language_prompts = {
    'en': 'prompts/entity_extraction_en.txt',
    'es': 'prompts/entity_extraction_es.txt',
    'fr': 'prompts/entity_extraction_fr.txt',
    'ja': 'prompts/entity_extraction_ja.txt',
}

Troubleshooting

Solutions:
  • Use language-specific prompts
  • Run auto prompt tuning with documents in target language
  • Increase chunk size for languages with longer words/characters
  • Verify model supports your target language well
Solutions:
  • Explicitly specify response language in your query
  • Use language-specific system prompts
  • Separate documents by language during indexing
Solutions:
  • Ensure all files are UTF-8 encoded
  • Verify .env file doesn’t have encoding issues
  • Check that storage systems support Unicode

Supported languages

OpenAI models (GPT-4, GPT-3.5-turbo, embeddings) have strong support for:
  • English, Spanish, French, German, Italian
  • Portuguese, Dutch, Polish, Russian
  • Swedish, Norwegian, Danish, Finnish
  • Greek, Turkish, Czech, Romanian

Next steps

Custom prompts

Create language-specific prompts

Document Q&A

Build multilingual Q&A systems

Azure deployment

Deploy with regional Azure endpoints

Prompt tuning

Auto-tune prompts for your language