Skip to content

G-Reasoner Reproduction

Goal

Reproduce the paper results for G-Reasoner using the maintained shell scripts under scripts/g-reasoner/. The workflow consists of three stages: text-embedding-based graph indexing, supervised fine-tuning, and QA evaluation.

Prerequisites

  • Environment setup from Install
  • OpenAI API key set in the environment (OPENAI_API_KEY) — used only in Stage 3 QA inference
  • Datasets placed under data/ (see Data Format)
  • A text embedding model available locally or via API (set as TEXT_EMBEDDING_MODEL)
  • Sufficient GPU memory: Stage 2 scripts default to 4 GPUs

Dataset Download

Download the testing split and full training data from OneDrive and place them under the data/ directory:

Text Only
data/
├── 2wikimultihopqa_test/
│   ├── processed/stage1/    # Pre-built graph index (provided)
│   └── raw/
├── hotpotqa_test_v2/
│   ├── processed/stage1/    # Pre-built graph index (provided)
│   └── raw/
├── hotpotqa_train_example/
│   ├── processed/stage1/
│   └── raw/
└── musique_test/
    ├── processed/stage1/    # Pre-built graph index (provided)
    └── raw/

G-Reasoner uses hotpotqa_test_v2 (not hotpotqa_test) as the validation split. The processed/stage1/ directories for the test sets are pre-built and provided in the download.

Evaluation on GraphRAG Benchmarks

Evaluate on two GraphRAG benchmarks. Start with data, build the KG index, run retrieval + QA, then score with the official scripts.

Install and set up the environment following the top-level README.md before running the commands here.

Benchmarks at a glance

Two GraphRAG benchmarks uses similar name, we will use the following names to avoid confusion.

  • graphrag_bench_cs → GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation (aka G-Bench CS). Repo
  • graphrag_benchmark_medical / graphrag_benchmark_novel → When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation (aka G-Bench Medical / Novel). Repo

1) Prepare data

Download our preprocessed data from here and place it under data/.

Text Only
data/
├── graphrag_bench_cs/
│   └── raw/
│       ├── documents.json
│       └── test.json
├── graphrag_benchmark_medical/
│   └── raw/
│       ├── documents.json
│       └── test.json
└── graphrag_benchmark_novel/
    └── raw/
        ├── documents.json
        └── test.json

2) Build the KG index

You can run indexing yourself or use our prebuilt KG indices. Download our prebuilt KG indices from here.

To build KG indices locally, create nodes.csv, edges.csv, relations.csv, and processed test.json for each dataset running the following script.

Bash
bash scripts/graphrag_benchmark_evaluation/scripts/stage1_data_index.sh

Expected layout after indexing:

Text Only
data/<dataset>/processed/stage1/
├── edges.csv
├── nodes.csv
├── relations.csv
└── test.json

3) Generate retrieval results

The QA scripts need retrieval results per dataset with top documents and entities for each question.

You can either run retrieval yourself or use our precomputed results. Download our precomputed retrieval results from here.

To generate retrieval results locally using GFM-RAG running the following script:

Bash
bash scripts/graphrag_benchmark_evaluation/scripts/stage2_retrieval.sh

4) Run QA

Use the provided scripts to load the retrieval outputs, build prompts, and call the LLM. Yon can check the prompts in config/qa_prompt/.

GraphRAG-Bench (CS)

Bash
bash scripts/graphrag_benchmark_evaluation/scripts/stage3_qa_inference_graphrag_bench.sh

Outputs: one JSON per task type (FB, MC, MS, OE, TF) named for the official evaluator.

GraphRAG-Benchmark (Novel / Medical)

Bash
bash scripts/graphrag_benchmark_evaluation/scripts/stage3_qa_inference_graphrag_benchmark.sh

Outputs: one prediction.jsonl.

5) Evaluate with the official scripts

We use the official evaluation scripts from corresponding repos with minimal modifications to fit our data and outputs.

GraphRAG-Bench (CS)

  1. Clone the repo and download their data.
  2. Copy the five JSON outputs from step 4 into GraphRAG-Bench/Datasets/output/g-reasoner/.
  3. Run the evaluator inside that repo, e.g. python evaluator.py.

GraphRAG-Benchmark (Novel / Medical)

  1. Clone the repo and download their data.
  2. Copy prediction.jsonl from step 4 into GraphRAG-Benchmark/Datasets/output/g-reasoner/<domain>/prediction.jsonl.
  3. Run the evaluation entry point, e.g. bash run_retreival_evaluation.sh and bash run_gen_evaluation.sh for retrieval and generation evaluation respectively.

Scripts Overview

Stage Script Purpose
1 scripts/g-reasoner/stage1_data_index.sh Build text-embedding-based graph index
2a scripts/g-reasoner/stage2_finetune.sh Supervised fine-tuning
2b scripts/g-reasoner/stage2_evaluate.sh Retrieval evaluation with a pre-trained checkpoint
3a scripts/g-reasoner/stage3_qa_inference.sh Batch QA from saved retrieval results
3b scripts/g-reasoner/stage3_qa_ircot_inference.sh Multi-step IRCoT QA reasoning

Stage 1: Build Graph Index

This stage uses a text embedding model (instead of LLM-based NER/OpenIE) to extract entities and construct a graph, then builds the SFT training pairs with the hipporag2_sft_constructor.

Output: data/<data_name>/processed/stage1/ containing nodes.csv, relations.csv, edges.csv, train.json, test.json.

Bash
bash scripts/g-reasoner/stage1_data_index.sh

What the script does

Index test datasets (no SFT constructor needed):

Bash
DATA_ROOT="data"
DATA_NAME_LIST="hotpotqa_test_v2 musique_test 2wikimultihopqa_test"
for DATA_NAME in ${DATA_NAME_LIST}; do
    python -m gfmrag.workflow.index_dataset \
        --config-path config/gfm_reasoner \
        dataset.root=${DATA_ROOT} \
        text_emb_model=${TEXT_EMBEDDING_MODEL} \
        dataset.data_name=${DATA_NAME}
done

Index training datasets (20 shards × 3 datasets, with SFT filtering):

Bash
DATA_ROOT="data"
DATA_NAME_LIST="hotpotqa_train musique_train 2wikimultihopqa_train"
START_N=0
END_N=19
for i in $(seq ${START_N} ${END_N}); do
    for DATA_NAME in ${DATA_NAME_LIST}; do
        python -m gfmrag.workflow.index_dataset \
            --config-path config/gfm_reasoner \
            dataset.root=${DATA_ROOT} \
            text_emb_model=${TEXT_EMBEDDING_MODEL} \
            sft_constructor.enable_filtering=${ENABLE_FILTERING} \
            dataset.data_name=${DATA_NAME}${i}
    done
done

Default Indexing Components

The gfm_reasoner config (gfmrag/workflow/config/gfm_reasoner/index_dataset.yaml) uses:

  • NER: LLM-based (llm_ner_model)
  • Entity Linking: ColBERT (colbert_el_model)
  • OpenIE: LLM-based (llm_openie_model)
  • Graph Constructor: KG constructor (kg_constructor)
  • Text Embedding: qwen3_8b (override with text_emb_model=<name>)
  • SFT Constructor: hipporag2_sft_constructor with optional enable_filtering

The pre-built test set indexes are included in the download — you can skip Stage 1 for test sets and go directly to Stage 2 evaluation.


Stage 2a: Supervised Fine-tuning

Fine-tune the G-Reasoner model on the QA datasets. The model uses a 6-layer QueryNBFNet with 1024-dimensional embeddings by default.

Bash
bash scripts/g-reasoner/stage2_finetune.sh

What the script does

Bash
N_GPU=4
N_EPOCH=10
BATCH_SIZE=4
N_DIM=1024
N_LAYERS="[${N_DIM},${N_DIM},${N_DIM},${N_DIM},${N_DIM},${N_DIM}]"
DATA_ROOT="data"

# Builds comma-separated list: musique_train0,...,2wikimultihopqa_train19
HYDRA_FULL_ERROR=1 torchrun --nproc_per_node=${N_GPU} -m gfmrag.workflow.sft_training \
    --config-path config/gfm_reasoner \
    model.entity_model.input_dim=${N_DIM} \
    model.entity_model.hidden_dims=${N_LAYERS} \
    datasets.cfgs.root=${DATA_ROOT} \
    datasets.train_names=[${TRAIN_DATA_NAME_LIST}] \
    datasets.valid_names=[hotpotqa_test_v2,musique_test,2wikimultihopqa_test] \
    trainer.args.num_epoch=${N_EPOCH} \
    trainer.args.train_batch_size=${BATCH_SIZE} \
    +trainer.training_mode=${TRAIN_MODE} \
    trainer.args.split_graph_training=${SPLIT_GRAPH_TRAINING} \
    trainer.args.split_graph_inference=${SPLIT_GRAPH_INFERENCE} \
    trainer.args.split_graph_partition=${SPLIT_GRAPH_METHOD}

Key training arguments:

Argument Default Description
N_GPU 4 Number of GPUs
N_EPOCH 10 Training epochs
BATCH_SIZE 4 Batch size per GPU
N_DIM 1024 Entity embedding dimension
TRAIN_MODE Training mode (set via env var)
SPLIT_GRAPH_TRAINING false Split large graphs during training
SPLIT_GRAPH_INFERENCE false Split large graphs during inference
SPLIT_GRAPH_METHOD metis Partition algorithm (metis or contiguous)

Output: Checkpoint and per-dataset retrieval predictions under outputs/qa_finetune/<date>/<time>/.


Stage 2b: Retrieval Evaluation

Evaluate the pre-trained rmanluo/G-reasoner-34M checkpoint (or a locally fine-tuned model) without re-training.

Bash
bash scripts/g-reasoner/stage2_evaluate.sh

What the script does

Bash
N_GPU=2
DATA_ROOT="data"
CHECKPOINT="rmanluo/G-reasoner-34M"

HYDRA_FULL_ERROR=1 torchrun --nproc_per_node=${N_GPU} -m gfmrag.workflow.sft_training \
    --config-path config/gfm_reasoner \
    --config-name sft_training \
    load_model_from_pretrained=${CHECKPOINT} \
    +datasets.cfgs.skip_empty_target=true \
    datasets.cfgs.root=${DATA_ROOT} \
    datasets.train_names=[] \
    datasets.valid_names=[hotpotqa_test_v2,musique_test,2wikimultihopqa_test] \
    +trainer.args.eval_batch_size=1 \
    trainer.metrics=[hits@2,hits@5,recall@2,recall@5,mrr] \
    trainer.args.do_train=false \
    trainer.args.do_eval=true \
    trainer.args.do_predict=true \
    trainer.args.split_graph_inference=false \
    trainer.args.split_graph_partition=metis

Output: predictions_<data_name>.json files under outputs/qa_finetune/<date>/<time>/.


Stage 3a: Single-Step QA Reasoning

Reads the retrieval predictions from Stage 2b and generates answers with an LLM in one shot.

Bash
bash scripts/g-reasoner/stage3_qa_inference.sh

What the script does

The script defaults to 2wikimultihopqa; change DATA_NAME to run other datasets:

Bash
DATA_ROOT="data"
DATA_NAME="2wikimultihopqa"   # hotpotqa | musique | 2wikimultihopqa
LLM="gpt-4o-mini"
DOC_TOP_K=5
N_THREAD=10
RETRIEVED_RESULT_PATH="outputs/qa_finetune/latest/predictions_${DATA_NAME}_test.json"
NODE_PATH="${DATA_ROOT}/${DATA_NAME}_test/processed/stage1/nodes.csv"

python -m gfmrag.workflow.qa \
    --config-path config/gfm_reasoner \
    qa_prompt=${DATA_NAME} \
    qa_evaluator=${DATA_NAME} \
    llm.model_name_or_path=${LLM} \
    test.n_threads=${N_THREAD} \
    test.top_k=${DOC_TOP_K} \
    test.retrieved_result_path=${RETRIEVED_RESULT_PATH} \
    test.target_types=[document] \
    test.node_path=${NODE_PATH}

Per-dataset Commands

HotpotQA

Bash
python -m gfmrag.workflow.qa \
    --config-path config/gfm_reasoner \
    qa_prompt=hotpotqa qa_evaluator=hotpotqa \
    test.retrieved_result_path=outputs/qa_finetune/latest/predictions_hotpotqa_test_v2.json \
    test.node_path=data/hotpotqa_test_v2/processed/stage1/nodes.csv

MuSiQue

Bash
python -m gfmrag.workflow.qa \
    --config-path config/gfm_reasoner \
    qa_prompt=musique qa_evaluator=musique \
    test.retrieved_result_path=outputs/qa_finetune/latest/predictions_musique_test.json \
    test.node_path=data/musique_test/processed/stage1/nodes.csv

2WikiMultihopQA

Bash
python -m gfmrag.workflow.qa \
    --config-path config/gfm_reasoner \
    qa_prompt=2wikimultihopqa qa_evaluator=2wikimultihopqa \
    test.retrieved_result_path=outputs/qa_finetune/latest/predictions_2wikimultihopqa_test.json \
    test.node_path=data/2wikimultihopqa_test/processed/stage1/nodes.csv

Output: outputs/qa_inference/<date>/<time>/prediction.jsonl


Stage 3b: Multi-Step IRCoT QA Reasoning

Runs iterative retrieval and reasoning (IRCoT) using the G-Reasoner retriever and an LLM agent. Unlike Stage 3a, this stage performs retrieval online and does not require pre-computed retrieval results.

Bash
bash scripts/g-reasoner/stage3_qa_ircot_inference.sh

What the script does

The script defaults to 2wikimultihopqa and 5 test samples for a quick sanity check:

Bash
DATA_ROOT="data"
DATA_NAME="2wikimultihopqa"   # hotpotqa | musique | 2wikimultihopqa
LLM="gpt-4o-mini"
MAX_STEPS=3
MAX_SAMPLE=5
MODEL_PATH="save_models/G-reasoner-34M"   # or rmanluo/G-reasoner-34M

HYDRA_FULL_ERROR=1 python -m gfmrag.workflow.qa_ircot_inference \
    --config-path config/gfm_reasoner \
    --config-name stage3_qa_ircot_inference \
    dataset.root=${DATA_ROOT} \
    llm.model_name_or_path=${LLM} \
    qa_prompt=${DATA_NAME} \
    qa_evaluator=${DATA_NAME} \
    agent_prompt=${DATA_NAME}_ircot \
    test.max_steps=${MAX_STEPS} \
    test.max_test_samples=${MAX_SAMPLE} \
    dataset.data_name=${DATA_NAME}_test \
    graph_retriever.model_path=${MODEL_PATH}

Set test.max_test_samples=-1 to run on the full test set.

Per-dataset Commands

HotpotQA (2 reasoning steps)

Bash
python -m gfmrag.workflow.qa_ircot_inference \
    --config-path config/gfm_reasoner \
    --config-name stage3_qa_ircot_inference \
    qa_prompt=hotpotqa qa_evaluator=hotpotqa \
    agent_prompt=hotpotqa_ircot \
    dataset.data_name=hotpotqa_test_v2 \
    graph_retriever.model_path=rmanluo/G-reasoner-34M \
    test.max_steps=2 test.max_test_samples=-1

MuSiQue (4 reasoning steps)

Bash
python -m gfmrag.workflow.qa_ircot_inference \
    --config-path config/gfm_reasoner \
    --config-name stage3_qa_ircot_inference \
    qa_prompt=musique qa_evaluator=musique \
    agent_prompt=musique_ircot \
    dataset.data_name=musique_test \
    graph_retriever.model_path=rmanluo/G-reasoner-34M \
    test.max_steps=4 test.max_test_samples=-1

2WikiMultihopQA (3 reasoning steps)

Bash
python -m gfmrag.workflow.qa_ircot_inference \
    --config-path config/gfm_reasoner \
    --config-name stage3_qa_ircot_inference \
    qa_prompt=2wikimultihopqa qa_evaluator=2wikimultihopqa \
    agent_prompt=2wikimultihopqa_ircot \
    dataset.data_name=2wikimultihopqa_test \
    graph_retriever.model_path=rmanluo/G-reasoner-34M \
    test.max_steps=3 test.max_test_samples=-1

Output: outputs/qa_agent_inference/<data_name>/<date>/<time>/prediction.jsonl


Expected Outputs Summary

Stage Output Location Contents
Stage 1 data/<data_name>/processed/stage1/ nodes.csv, relations.csv, edges.csv, train.json, test.json
Stage 2a (train) outputs/qa_finetune/<date>/<time>/ Model checkpoints, training logs
Stage 2b (eval) outputs/qa_finetune/<date>/<time>/ predictions_<data_name>.json per dataset
Stage 3a outputs/qa_inference/<date>/<time>/ prediction.jsonl with answers and scores
Stage 3b outputs/qa_agent_inference/<data_name>/<date>/<time>/ prediction.jsonl with answers and scores

Notes

  • Config path: All G-Reasoner scripts pass --config-path config/gfm_reasoner to select the G-Reasoner config family instead of the default GFM-RAG config. This is a relative path resolved from the working directory.
  • Pre-built indexes: The test set processed/stage1/ directories are included in the provided download. You do not need to run Stage 1 for test sets.
  • Released checkpoint: rmanluo/G-reasoner-34M can be loaded directly for Stage 2b retrieval evaluation and Stage 3b IRCoT inference, skipping Stage 2a training.
  • hotpotqa_test_v2: G-Reasoner validates on hotpotqa_test_v2 (not hotpotqa_test). Ensure you download and index this split.
  • TEXT_EMBEDDING_MODEL: Stage 1 requires a text embedding model. Set this env variable to a config name under gfmrag/workflow/config/text_emb_model/ (default: qwen3_8b). Unlike GFM-RAG, this replaces LLM NER/OpenIE, so no OpenAI API key is needed in Stage 1.
  • TRAIN_MODE: Required for Stage 2a fine-tuning. Check the gfm_reasoner config or paper appendix for valid values.
  • Graph splitting: For very large graphs, enable split_graph_training=true and/or split_graph_inference=true. Use split_graph_partition=metis (requires METIS library) or contiguous.
  • Stage 2b produces Stage 3a inputs: The QA script requires predictions_<data_name>.json from Stage 2b and the matching nodes.csv. The path outputs/qa_finetune/latest/ is a symlink to the most recent run.
  • HYDRA_FULL_ERROR=1: Set this variable to get full Python tracebacks instead of truncated Hydra error messages.
  • LLM API costs: Stage 3 calls the OpenAI API (GPT-4o-mini by default). Set llm.model_name_or_path to change the model.
  • For general usage outside these reproduction scripts, see the Workflow pages.