G-Reasoner Reproduction¶

Goal¶

Reproduce the paper results for G-Reasoner using the maintained shell scripts under scripts/g-reasoner/. The workflow consists of three stages: text-embedding-based graph indexing, supervised fine-tuning, and QA evaluation.

Prerequisites¶

Environment setup from Install
OpenAI API key set in the environment (OPENAI_API_KEY) — used only in Stage 3 QA inference
Datasets placed under data/ (see Data Format)
A text embedding model available locally or via API (set as TEXT_EMBEDDING_MODEL)
Sufficient GPU memory: Stage 2 scripts default to 4 GPUs

Dataset Download¶

Download the testing split and full training data from OneDrive and place them under the data/ directory:

Text Only

data/
├── 2wikimultihopqa_test/
│   ├── processed/stage1/    # Pre-built graph index (provided)
│   └── raw/
├── hotpotqa_test_v2/
│   ├── processed/stage1/    # Pre-built graph index (provided)
│   └── raw/
├── hotpotqa_train_example/
│   ├── processed/stage1/
│   └── raw/
└── musique_test/
    ├── processed/stage1/    # Pre-built graph index (provided)
    └── raw/

G-Reasoner uses hotpotqa_test_v2 (not hotpotqa_test) as the validation split. The processed/stage1/ directories for the test sets are pre-built and provided in the download.

Scripts Overview¶

Stage	Script	Purpose
1	`scripts/g-reasoner/stage1_data_index.sh`	Build text-embedding-based graph index
2a	`scripts/g-reasoner/stage2_finetune.sh`	Supervised fine-tuning
2b	`scripts/g-reasoner/stage2_evaluate.sh`	Retrieval evaluation with a pre-trained checkpoint
3a	`scripts/g-reasoner/stage3_qa_inference.sh`	Batch QA from saved retrieval results
3b	`scripts/g-reasoner/stage3_qa_ircot_inference.sh`	Multi-step IRCoT QA reasoning

Stage 1: Build Graph Index¶

This stage uses a text embedding model (instead of LLM-based NER/OpenIE) to extract entities and construct a graph, then builds the SFT training pairs with the hipporag2_sft_constructor.

Output: data/<data_name>/processed/stage1/ containing nodes.csv, relations.csv, edges.csv, train.json, test.json.

Bash

bash scripts/g-reasoner/stage1_data_index.sh

What the script does¶

Index test datasets (no SFT constructor needed):

Bash

DATA_ROOT="data"
DATA_NAME_LIST="hotpotqa_test_v2 musique_test 2wikimultihopqa_test"
for DATA_NAME in ${DATA_NAME_LIST}; do
    python -m gfmrag.workflow.index_dataset \
        --config-path config/gfm_reasoner \
        dataset.root=${DATA_ROOT} \
        text_emb_model=${TEXT_EMBEDDING_MODEL} \
        dataset.data_name=${DATA_NAME}
done

Index training datasets (20 shards × 3 datasets, with SFT filtering):

Bash

DATA_ROOT="data"
DATA_NAME_LIST="hotpotqa_train musique_train 2wikimultihopqa_train"
START_N=0
END_N=19
for i in $(seq ${START_N} ${END_N}); do
    for DATA_NAME in ${DATA_NAME_LIST}; do
        python -m gfmrag.workflow.index_dataset \
            --config-path config/gfm_reasoner \
            dataset.root=${DATA_ROOT} \
            text_emb_model=${TEXT_EMBEDDING_MODEL} \
            sft_constructor.enable_filtering=${ENABLE_FILTERING} \
            dataset.data_name=${DATA_NAME}${i}
    done
done

Default Indexing Components¶

The gfm_reasoner config (gfmrag/workflow/config/gfm_reasoner/index_dataset.yaml) uses:

NER: LLM-based (llm_ner_model)
Entity Linking: ColBERT (colbert_el_model)
OpenIE: LLM-based (llm_openie_model)
Graph Constructor: KG constructor (kg_constructor)
Text Embedding: qwen3_8b (override with text_emb_model=<name>)
SFT Constructor: hipporag2_sft_constructor with optional enable_filtering

The pre-built test set indexes are included in the download — you can skip Stage 1 for test sets and go directly to Stage 2 evaluation.

Stage 2a: Supervised Fine-tuning¶

Fine-tune the G-Reasoner model on the QA datasets. The model uses a 6-layer QueryNBFNet with 1024-dimensional embeddings by default.

Bash

bash scripts/g-reasoner/stage2_finetune.sh

What the script does¶

Bash

N_GPU=4
N_EPOCH=10
BATCH_SIZE=4
N_DIM=1024
N_LAYERS="[${N_DIM},${N_DIM},${N_DIM},${N_DIM},${N_DIM},${N_DIM}]"
DATA_ROOT="data"

# Builds comma-separated list: musique_train0,...,2wikimultihopqa_train19
HYDRA_FULL_ERROR=1 torchrun --nproc_per_node=${N_GPU} -m gfmrag.workflow.sft_training \
    --config-path config/gfm_reasoner \
    model.entity_model.input_dim=${N_DIM} \
    model.entity_model.hidden_dims=${N_LAYERS} \
    datasets.cfgs.root=${DATA_ROOT} \
    datasets.train_names=[${TRAIN_DATA_NAME_LIST}] \
    datasets.valid_names=[hotpotqa_test_v2,musique_test,2wikimultihopqa_test] \
    trainer.args.num_epoch=${N_EPOCH} \
    trainer.args.train_batch_size=${BATCH_SIZE} \
    +trainer.training_mode=${TRAIN_MODE} \
    trainer.args.split_graph_training=${SPLIT_GRAPH_TRAINING} \
    trainer.args.split_graph_inference=${SPLIT_GRAPH_INFERENCE} \
    trainer.args.split_graph_partition=${SPLIT_GRAPH_METHOD}

Key training arguments:

Argument	Default	Description
`N_GPU`	4	Number of GPUs
`N_EPOCH`	10	Training epochs
`BATCH_SIZE`	4	Batch size per GPU
`N_DIM`	1024	Entity embedding dimension
`TRAIN_MODE`	—	Training mode (set via env var)
`SPLIT_GRAPH_TRAINING`	false	Split large graphs during training
`SPLIT_GRAPH_INFERENCE`	false	Split large graphs during inference
`SPLIT_GRAPH_METHOD`	`metis`	Partition algorithm (`metis` or `contiguous`)

Output: Checkpoint and per-dataset retrieval predictions under outputs/qa_finetune/<date>/<time>/.

Stage 2b: Retrieval Evaluation¶

Evaluate the pre-trained rmanluo/G-reasoner-34M checkpoint (or a locally fine-tuned model) without re-training.

Bash

bash scripts/g-reasoner/stage2_evaluate.sh

What the script does¶

Bash

N_GPU=2
DATA_ROOT="data"
CHECKPOINT="rmanluo/G-reasoner-34M"

HYDRA_FULL_ERROR=1 torchrun --nproc_per_node=${N_GPU} -m gfmrag.workflow.sft_training \
    --config-path config/gfm_reasoner \
    --config-name sft_training \
    load_model_from_pretrained=${CHECKPOINT} \
    +datasets.cfgs.skip_empty_target=true \
    datasets.cfgs.root=${DATA_ROOT} \
    datasets.train_names=[] \
    datasets.valid_names=[hotpotqa_test_v2,musique_test,2wikimultihopqa_test] \
    +trainer.args.eval_batch_size=1 \
    trainer.metrics=[hits@2,hits@5,recall@2,recall@5,mrr] \
    trainer.args.do_train=false \
    trainer.args.do_eval=true \
    trainer.args.do_predict=true \
    trainer.args.split_graph_inference=false \
    trainer.args.split_graph_partition=metis

Output: predictions_<data_name>.json files under outputs/qa_finetune/<date>/<time>/.

Stage 3a: Single-Step QA Reasoning¶

Reads the retrieval predictions from Stage 2b and generates answers with an LLM in one shot.

Bash

bash scripts/g-reasoner/stage3_qa_inference.sh

What the script does¶

The script defaults to 2wikimultihopqa; change DATA_NAME to run other datasets:

Bash

DATA_ROOT="data"
DATA_NAME="2wikimultihopqa"   # hotpotqa | musique | 2wikimultihopqa
LLM="gpt-4o-mini"
DOC_TOP_K=5
N_THREAD=10
RETRIEVED_RESULT_PATH="outputs/qa_finetune/latest/predictions_${DATA_NAME}_test.json"
NODE_PATH="${DATA_ROOT}/${DATA_NAME}_test/processed/stage1/nodes.csv"

python -m gfmrag.workflow.qa \
    --config-path config/gfm_reasoner \
    qa_prompt=${DATA_NAME} \
    qa_evaluator=${DATA_NAME} \
    llm.model_name_or_path=${LLM} \
    test.n_threads=${N_THREAD} \
    test.top_k=${DOC_TOP_K} \
    test.retrieved_result_path=${RETRIEVED_RESULT_PATH} \
    test.target_types=[document] \
    test.node_path=${NODE_PATH}

Per-dataset Commands¶

HotpotQA

Bash

python -m gfmrag.workflow.qa \
    --config-path config/gfm_reasoner \
    qa_prompt=hotpotqa qa_evaluator=hotpotqa \
    test.retrieved_result_path=outputs/qa_finetune/latest/predictions_hotpotqa_test_v2.json \
    test.node_path=data/hotpotqa_test_v2/processed/stage1/nodes.csv

MuSiQue

Bash

python -m gfmrag.workflow.qa \
    --config-path config/gfm_reasoner \
    qa_prompt=musique qa_evaluator=musique \
    test.retrieved_result_path=outputs/qa_finetune/latest/predictions_musique_test.json \
    test.node_path=data/musique_test/processed/stage1/nodes.csv

2WikiMultihopQA

Bash

python -m gfmrag.workflow.qa \
    --config-path config/gfm_reasoner \
    qa_prompt=2wikimultihopqa qa_evaluator=2wikimultihopqa \
    test.retrieved_result_path=outputs/qa_finetune/latest/predictions_2wikimultihopqa_test.json \
    test.node_path=data/2wikimultihopqa_test/processed/stage1/nodes.csv

Output: outputs/qa_inference/<date>/<time>/prediction.jsonl

Stage 3b: Multi-Step IRCoT QA Reasoning¶

Runs iterative retrieval and reasoning (IRCoT) using the G-Reasoner retriever and an LLM agent. Unlike Stage 3a, this stage performs retrieval online and does not require pre-computed retrieval results.

Bash

bash scripts/g-reasoner/stage3_qa_ircot_inference.sh

What the script does¶

The script defaults to 2wikimultihopqa and 5 test samples for a quick sanity check:

Bash

DATA_ROOT="data"
DATA_NAME="2wikimultihopqa"   # hotpotqa | musique | 2wikimultihopqa
LLM="gpt-4o-mini"
MAX_STEPS=3
MAX_SAMPLE=5
MODEL_PATH="save_models/G-reasoner-34M"   # or rmanluo/G-reasoner-34M

HYDRA_FULL_ERROR=1 python -m gfmrag.workflow.qa_ircot_inference \
    --config-path config/gfm_reasoner \
    --config-name stage3_qa_ircot_inference \
    dataset.root=${DATA_ROOT} \
    llm.model_name_or_path=${LLM} \
    qa_prompt=${DATA_NAME} \
    qa_evaluator=${DATA_NAME} \
    agent_prompt=${DATA_NAME}_ircot \
    test.max_steps=${MAX_STEPS} \
    test.max_test_samples=${MAX_SAMPLE} \
    dataset.data_name=${DATA_NAME}_test \
    graph_retriever.model_path=${MODEL_PATH}

Set test.max_test_samples=-1 to run on the full test set.

Per-dataset Commands¶

HotpotQA (2 reasoning steps)

Bash

python -m gfmrag.workflow.qa_ircot_inference \
    --config-path config/gfm_reasoner \
    --config-name stage3_qa_ircot_inference \
    qa_prompt=hotpotqa qa_evaluator=hotpotqa \
    agent_prompt=hotpotqa_ircot \
    dataset.data_name=hotpotqa_test_v2 \
    graph_retriever.model_path=rmanluo/G-reasoner-34M \
    test.max_steps=2 test.max_test_samples=-1

MuSiQue (4 reasoning steps)

Bash

python -m gfmrag.workflow.qa_ircot_inference \
    --config-path config/gfm_reasoner \
    --config-name stage3_qa_ircot_inference \
    qa_prompt=musique qa_evaluator=musique \
    agent_prompt=musique_ircot \
    dataset.data_name=musique_test \
    graph_retriever.model_path=rmanluo/G-reasoner-34M \
    test.max_steps=4 test.max_test_samples=-1

2WikiMultihopQA (3 reasoning steps)

Bash

python -m gfmrag.workflow.qa_ircot_inference \
    --config-path config/gfm_reasoner \
    --config-name stage3_qa_ircot_inference \
    qa_prompt=2wikimultihopqa qa_evaluator=2wikimultihopqa \
    agent_prompt=2wikimultihopqa_ircot \
    dataset.data_name=2wikimultihopqa_test \
    graph_retriever.model_path=rmanluo/G-reasoner-34M \
    test.max_steps=3 test.max_test_samples=-1

Output: outputs/qa_agent_inference/<data_name>/<date>/<time>/prediction.jsonl

Expected Outputs Summary¶

Stage	Output Location	Contents
Stage 1	`data/<data_name>/processed/stage1/`	`nodes.csv`, `relations.csv`, `edges.csv`, `train.json`, `test.json`
Stage 2a (train)	`outputs/qa_finetune/<date>/<time>/`	Model checkpoints, training logs
Stage 2b (eval)	`outputs/qa_finetune/<date>/<time>/`	`predictions_<data_name>.json` per dataset
Stage 3a	`outputs/qa_inference/<date>/<time>/`	`prediction.jsonl` with answers and scores
Stage 3b	`outputs/qa_agent_inference/<data_name>/<date>/<time>/`	`prediction.jsonl` with answers and scores

Evaluation on GraphRAG Benchmarks¶

Evaluate on two GraphRAG benchmarks. Start with data, build the KG index, run retrieval + QA, then score with the official scripts.

Install and set up the environment following the top-level README.md before running the commands here.

Benchmarks at a glance¶

Two GraphRAG benchmarks uses similar name, we will use the following names to avoid confusion.

graphrag_bench_cs → GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation (aka G-Bench CS). Repo
graphrag_benchmark_medical / graphrag_benchmark_novel → When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation (aka G-Bench Medical / Novel). Repo

1) Prepare data¶

Download our preprocessed data from here and place it under data/.

Text Only

data/
├── graphrag_bench_cs/
│   └── raw/
│       ├── documents.json
│       └── test.json
├── graphrag_benchmark_medical/
│   └── raw/
│       ├── documents.json
│       └── test.json
└── graphrag_benchmark_novel/
    └── raw/
        ├── documents.json
        └── test.json

2) Build the KG index¶

You can run indexing yourself or use our prebuilt KG indices. Download our prebuilt KG indices from here.

To build KG indices locally, create nodes.csv, edges.csv, relations.csv, and processed test.json for each dataset running the following script.

Bash

bash scripts/graphrag_benchmark_evaluation/scripts/stage1_data_index.sh

Expected layout after indexing:

Text Only

data/<dataset>/processed/stage1/
├── edges.csv
├── nodes.csv
├── relations.csv
└── test.json

3) Generate retrieval results¶

The QA scripts need retrieval results per dataset with top documents and entities for each question.

You can either run retrieval yourself or use our precomputed results. Download our precomputed retrieval results from here.

To generate retrieval results locally using GFM-RAG running the following script:

Bash

bash scripts/graphrag_benchmark_evaluation/scripts/stage2_retrieval.sh

4) Run QA¶

Use the provided scripts to load the retrieval outputs, build prompts, and call the LLM. Yon can check the prompts in config/qa_prompt/.

GraphRAG-Bench (CS)¶

Bash

bash scripts/graphrag_benchmark_evaluation/scripts/stage3_qa_inference_graphrag_bench.sh

Outputs: one JSON per task type (FB, MC, MS, OE, TF) named for the official evaluator.

GraphRAG-Benchmark (Novel / Medical)¶

Bash

bash scripts/graphrag_benchmark_evaluation/scripts/stage3_qa_inference_graphrag_benchmark.sh

Outputs: one prediction.jsonl.

5) Evaluate with the official scripts¶

We use the official evaluation scripts from corresponding repos with minimal modifications to fit our data and outputs.

GraphRAG-Bench (CS)¶

Clone the repo and download their data.
Copy the five JSON outputs from step 4 into GraphRAG-Bench/Datasets/output/g-reasoner/.
Run the evaluator inside that repo, e.g. python evaluator.py.

GraphRAG-Benchmark (Novel / Medical)¶

Clone the repo and download their data.
Copy prediction.jsonl from step 4 into GraphRAG-Benchmark/Datasets/output/g-reasoner/<domain>/prediction.jsonl.
Run the evaluation entry point, e.g. bash run_retreival_evaluation.sh and bash run_gen_evaluation.sh for retrieval and generation evaluation respectively.

Notes¶

Config path: All G-Reasoner scripts pass --config-path config/gfm_reasoner to select the G-Reasoner config family instead of the default GFM-RAG config. This is a relative path resolved from the working directory.
Pre-built indexes: The test set processed/stage1/ directories are included in the provided download. You do not need to run Stage 1 for test sets.
Released checkpoint: rmanluo/G-reasoner-34M can be loaded directly for Stage 2b retrieval evaluation and Stage 3b IRCoT inference, skipping Stage 2a training.
hotpotqa_test_v2: G-Reasoner validates on hotpotqa_test_v2 (not hotpotqa_test). Ensure you download and index this split.
TEXT_EMBEDDING_MODEL: Stage 1 requires a text embedding model. Set this env variable to a config name under gfmrag/workflow/config/text_emb_model/ (default: qwen3_8b). Unlike GFM-RAG, this replaces LLM NER/OpenIE, so no OpenAI API key is needed in Stage 1.
TRAIN_MODE: Required for Stage 2a fine-tuning. Check the gfm_reasoner config or paper appendix for valid values.
Graph splitting: For very large graphs, enable split_graph_training=true and/or split_graph_inference=true. Use split_graph_partition=metis (requires METIS library) or contiguous.
Stage 2b produces Stage 3a inputs: The QA script requires predictions_<data_name>.json from Stage 2b and the matching nodes.csv. The path outputs/qa_finetune/latest/ is a symlink to the most recent run.
HYDRA_FULL_ERROR=1: Set this variable to get full Python tracebacks instead of truncated Hydra error messages.
LLM API costs: Stage 3 calls the OpenAI API (GPT-4o-mini by default). Set llm.model_name_or_path to change the model.
For general usage outside these reproduction scripts, see the Workflow pages.