Multi-lingual support

GraphRAG supports processing and querying documents in multiple languages by leveraging multilingual language models. This guide shows you how to configure GraphRAG for non-English content and mixed-language datasets.

Language model selection

The key to multi-lingual support is choosing models that support your target languages:

GPT-4 and GPT-4 Turbo

Excellent support for 50+ languages including European, Asian, and Middle Eastern languages

GPT-3.5 Turbo

Good support for major languages, more cost-effective for large-scale processing

Multilingual embeddings

text-embedding-3-small and text-embedding-3-large support 100+ languages

Azure OpenAI

Same language support with additional compliance and regional deployment options

Basic configuration

No special configuration is needed for most languages. The default settings work well:

settings.yaml

completion_models:
  default_completion_model:
    model_provider: openai
    model: gpt-4  # Supports multiple languages out of the box
    api_key: ${GRAPHRAG_API_KEY}

embedding_models:
  default_embedding_model:
    model_provider: openai
    model: text-embedding-3-small  # Multilingual embeddings
    api_key: ${GRAPHRAG_API_KEY}

Language-specific prompt tuning

For better results, customize prompts in your target language:

Spanish
French
Japanese
Chinese

prompts/entity_extraction_es.txt

-Objetivo-
Dado un documento de texto, identifique todas las entidades y sus relaciones.

-Pasos-
1. Identifique todas las entidades en el texto
2. Para cada entidad, extraiga los atributos relevantes
3. Identifique las relaciones entre entidades
4. Formatee la salida según lo especificado

-Tipos de Entidad-
- PERSONA: Individuos humanos
- ORGANIZACIÓN: Empresas, instituciones
- UBICACIÓN: Lugares físicos o virtuales
- EVENTO: Sucesos significativos

prompts/entity_extraction_fr.txt

-Objectif-
Étant donné un document texte, identifiez toutes les entités et leurs relations.

-Étapes-
1. Identifiez toutes les entités dans le texte
2. Pour chaque entité, extrayez les attributs pertinents
3. Identifiez les relations entre les entités
4. Formatez la sortie comme spécifié

-Types d'Entité-
- PERSONNE: Individus humains
- ORGANISATION: Entreprises, institutions
- LIEU: Endroits physiques ou virtuels
- ÉVÉNEMENT: Événements importants

prompts/entity_extraction_ja.txt

-目標-
テキスト文書が与えられた場合、すべてのエンティティとその関係を識別します。

-手順-
1. テキスト内のすべてのエンティティを識別する
2. 各エンティティについて、関連する属性を抽出する
3. エンティティ間の関係を識別する
4. 指定された形式で出力する

-エンティティタイプ-
- 人物: 人間個人
- 組織: 企業、機関
- 場所: 物理的または仮想的な場所
- イベント: 重要な出来事

prompts/entity_extraction_zh.txt

-目标-
给定一个文本文档，识别所有实体及其关系。

-步骤-
1. 识别文本中的所有实体
2. 对于每个实体，提取相关属性
3. 识别实体之间的关系
4. 按指定格式输出

-实体类型-
- 人物: 人类个体
- 组织: 公司、机构
- 地点: 物理或虚拟位置
- 事件: 重要事件

Processing mixed-language documents

When working with documents in multiple languages:

Organize by language

Optionally separate documents by language for better tracking:

input/
├── en/
│   ├── document1.txt
│   └── document2.txt
├── es/
│   ├── documento1.txt
│   └── documento2.txt
└── fr/
    ├── document1.txt
    └── document2.txt

Use language-agnostic prompts

For truly mixed content, use English prompts with instructions to handle multiple languages:

-Goal-
Given a text document that may contain content in multiple languages,
identify all entities and relationships. Preserve entity names in their
original language and provide descriptions in the source language.

-Instructions-
- Maintain entity names in original language
- Generate descriptions in the same language as the source text
- Identify language of each extracted entity

Configure language metadata

Track language information in your chunking configuration:

chunking:
  prepend_metadata: ["language", "source_file"]

Querying in multiple languages

GraphRAG supports queries in different languages:

graphrag query "What are the main themes in this dataset?"

graphrag query "¿Cuáles son los temas principales en este conjunto de datos?"

graphrag query "Quels sont les thèmes principaux de cet ensemble de données?"

graphrag query "このデータセットの主なテーマは何ですか？"

graphrag query "这个数据集的主要主题是什么？"

The LLM will attempt to respond in the same language as your query. For mixed-language datasets, you can specify the desired response language in your query.

Best practices for specific languages

Chinese (Simplified/Traditional)

Use larger chunk sizes due to character density
Consider using gpt-4 for better understanding of classical vs. modern Chinese
Test entity extraction with both simplified and traditional characters

chunking:
  size: 800  # Increase from default 400
  overlap: 200  # Increase overlap proportionally

Japanese

Account for mixed scripts (Hiragana, Katakana, Kanji)
Use character-based rather than token-based chunking
Test entity extraction with company names (often use Kanji)

chunking:
  type: sentence  # Better for Japanese text
  size: 600

Arabic/Hebrew (RTL languages)

Ensure text encoding is UTF-8
Be aware that some entity names may be transliterated
Test with mixed RTL/LTR content (common in technical documents)

input:
  encoding: utf-8

German

Long compound words may need special handling
Consider noun capitalization in entity extraction
Adjust chunk sizes for longer words

chunking:
  size: 500  # Slightly larger for compound words

Example: Multi-lingual research corpus

Here’s a complete example for processing academic papers in multiple languages:

Prepare data

Organize papers with language metadata:

input/papers.csv

id,title,text,language,author,year
1,"Machine Learning Basics","Full text...",en,"Smith",2023
2,"Apprentissage Automatique","Texte complet...",fr,"Dubois",2023
3,"機械学習の基礎","全文...",ja,"田中",2023

Configure for CSV input

settings.yaml

input:
  type: csv
  file_pattern: .*\.csv$
  id_column: id
  title_column: title
  text_column: text

chunking:
  prepend_metadata: ["language", "author", "year"]
  size: 600
  overlap: 100

Create multilingual prompts

prompts/entity_extraction_multilingual.txt

-Goal-
Extract academic entities from research papers in any language.
Preserve technical terms and names in their original language.

-Entity Types-
- RESEARCHER: Authors and cited researchers
- CONCEPT: Scientific concepts and methods
- INSTITUTION: Universities and research organizations
- PUBLICATION: Papers, journals, conferences

-Instructions-
- Keep researcher names in original form
- Preserve technical terminology in source language
- Translate descriptions to match source document language

Query across languages

# Query in English
graphrag query "What machine learning concepts are discussed?" --method global

# Query in French
graphrag query "Quels chercheurs sont mentionnés?" --method local

# Query in Japanese  
graphrag query "どの研究機関が関与していますか？" --method local

Language detection and routing

For advanced use cases, implement language detection:

from langdetect import detect
import pandas as pd

# Detect language in documents
df = pd.read_csv('input/documents.csv')
df['detected_language'] = df['text'].apply(detect)

# Route to language-specific prompts
language_prompts = {
    'en': 'prompts/entity_extraction_en.txt',
    'es': 'prompts/entity_extraction_es.txt',
    'fr': 'prompts/entity_extraction_fr.txt',
    'ja': 'prompts/entity_extraction_ja.txt',
}

Troubleshooting

Poor entity extraction in target language

Solutions:

Use language-specific prompts
Run auto prompt tuning with documents in target language
Increase chunk size for languages with longer words/characters
Verify model supports your target language well

Mixed language responses

Solutions:

Explicitly specify response language in your query
Use language-specific system prompts
Separate documents by language during indexing

Encoding issues

Solutions:

Ensure all files are UTF-8 encoded
Verify .env file doesn’t have encoding issues
Check that storage systems support Unicode

Supported languages

OpenAI models (GPT-4, GPT-3.5-turbo, embeddings) have strong support for:

European
Asian
Middle Eastern
Others

English, Spanish, French, German, Italian
Portuguese, Dutch, Polish, Russian
Swedish, Norwegian, Danish, Finnish
Greek, Turkish, Czech, Romanian

Next steps

Custom prompts

Create language-specific prompts

Document Q&A

Build multilingual Q&A systems

Azure deployment

Deploy with regional Azure endpoints

Prompt tuning

Auto-tune prompts for your language

​Language model selection

GPT-4 and GPT-4 Turbo

GPT-3.5 Turbo

Multilingual embeddings

Azure OpenAI

​Basic configuration

​Language-specific prompt tuning

​Processing mixed-language documents

​Querying in multiple languages

​Best practices for specific languages

​Example: Multi-lingual research corpus

​Language detection and routing

​Troubleshooting

​Supported languages

​Next steps

Custom prompts

Document Q&A

Azure deployment

Prompt tuning

Language model selection

Basic configuration

Language-specific prompt tuning

Processing mixed-language documents

Querying in multiple languages

Best practices for specific languages

Example: Multi-lingual research corpus

Language detection and routing

Troubleshooting

Supported languages

Next steps