Skip to content

GFM-RAG Reproduction

Compatibility Notice

The original GFM-RAG paper results were produced with the v1.0.0 codebase. While we have made every effort to maintain backward compatibility, if you need exact reproduction, please use the gfm-rag branch.

Goal

Reproduce the paper results for GFM-RAG using the maintained shell scripts under scripts/gfm-rag/. The workflow consists of three stages: graph indexing, GFM training (pre-training + fine-tuning), and QA evaluation.

Prerequisites

  • Environment setup from Install
  • OpenAI API key set in the environment (OPENAI_API_KEY) — used by the NER and OpenIE models during indexing
  • Datasets placed under data/ (see Data Format)
  • Sufficient GPU memory: Stage 2 scripts default to 8 GPUs

Dataset Download

Download the testing split and full training data from OneDrive and place them under the data/ directory:

Dataset Format

If you are on the v2.0.0 branch, you need to download the GFM-RAG_training_new_graph_format.zip file.

Text Only
data/
├── 2wikimultihopqa_test/
│   ├── processed/stage1/    # Pre-built graph index (provided)
│   └── raw/
├── hotpotqa_test/
│   ├── processed/stage1/    # Pre-built graph index (provided)
│   └── raw/
├── hotpotqa_train_example/
│   ├── processed/stage1/
│   └── raw/
└── musique_test/
    ├── processed/stage1/    # Pre-built graph index (provided)
    └── raw/

The processed/stage1/ directories for the test sets are pre-built and provided in the download. You only need to run Stage 1 if you are indexing training data or a custom dataset.

Scripts To Run

Stage Script Purpose
1 scripts/gfm-rag/stage1_data_index.sh Build graph index from raw documents
2a scripts/gfm-rag/stage2_pretrain.sh Unsupervised KG pre-training (optional)
2b scripts/gfm-rag/stage2_finetune.sh Supervised fine-tuning + retrieval evaluation
3a scripts/gfm-rag/stage3_qa_inference.sh Batch QA from saved retrieval results
3b scripts/gfm-rag/stage3_qa_ircot_inference.sh Multi-step IRCoT QA reasoning

Stage 1: Build Graph Index

This stage extracts entities and relations from raw documents using LLM-based NER and OpenIE, then constructs a knowledge graph.

Output: data/<data_name>/processed/stage1/ containing nodes.csv, relations.csv, edges.csv, train.json, test.json.

Index Test Datasets

Bash
bash scripts/gfm-rag/stage1_data_index.sh

The script runs indexing for all three test sets and all training data shards (20 shards × 3 datasets):

Bash
# Test datasets (single GPU, no training data needed)
N_GPU=1
DATA_ROOT="data"
DATA_NAME_LIST="hotpotqa_test 2wikimultihopqa_test musique_test"
for DATA_NAME in ${DATA_NAME_LIST}; do
   HYDRA_FULL_ERROR=1 python -m gfmrag.workflow.index_dataset \
   dataset.root=${DATA_ROOT} \
   dataset.data_name=${DATA_NAME}
done

# Training data shards (hotpotqa_train0 … hotpotqa_train19, etc.)
DATA_NAME_LIST="hotpotqa_train musique_train 2wikimultihopqa_train"
START_N=0
END_N=19
for i in $(seq ${START_N} ${END_N}); do
   for DATA_NAME in ${DATA_NAME_LIST}; do
      HYDRA_FULL_ERROR=1 python -m gfmrag.workflow.index_dataset \
      dataset.root=${DATA_ROOT} \
      dataset.data_name=${DATA_NAME}${i}
   done
done

The pre-built test set indexes are included in the download — you can skip indexing test sets and go directly to Stage 2 fine-tuning or Stage 3 evaluation.

Default Indexing Components

The default config uses:

  • NER: LLM-based (llm_ner_model, GPT-4o-mini via OpenAI API)
  • Entity Linking: ColBERT with Qdrant vector DB backend (colbert_el_model)
  • OpenIE: LLM-based (llm_openie_model, GPT-4o-mini)
  • Graph Constructor: KG constructor (kg_constructor)

To override a component:

Bash
python -m gfmrag.workflow.index_dataset \
    dataset.root=data \
    dataset.data_name=hotpotqa_test \
    ner_model=llm_ner_model \
    el_model=colbert_el_model

Stage 2a: KG Pre-training (Optional)

Unsupervised pre-training on the constructed knowledge graphs. Skip this step if you are loading the released checkpoint (rmanluo/GFM-RAG-8M).

Bash
bash scripts/gfm-rag/stage2_pretrain.sh

Equivalent command (8 GPUs, 1 epoch, 30,000 steps per epoch):

Bash
N_GPU=8
DATA_ROOT="data"
# Builds a comma-separated list: hotpotqa_train0,...,2wikimultihopqa_train19
HYDRA_FULL_ERROR=1 torchrun --nproc-per-node=${N_GPU} -m gfmrag.workflow.kgc_training \
    datasets.train_names=[${TRAIN_DATA_NAME_LIST}] \
    datasets.cfgs.root=${DATA_ROOT} \
    trainer.fast_test=5000 \
    trainer.args.num_epoch=1 \
    trainer.args.max_steps_per_epoch=30000 \
    trainer.args.train_batch_size=4

Output: Model checkpoint under outputs/kg_pretrain/<date>/<time>/.


Stage 2b: Supervised Fine-tuning + Retrieval Evaluation

Fine-tuning

Fine-tune on the QA datasets. This is required to reproduce the reported retrieval numbers.

Bash
bash scripts/gfm-rag/stage2_finetune.sh

Equivalent command (8 GPUs, 15 epochs):

Bash
N_GPU=8
DATA_ROOT="data"
HYDRA_FULL_ERROR=1 torchrun --nproc_per_node=${N_GPU} -m gfmrag.workflow.sft_training \
    datasets.train_names=[${TRAIN_DATA_NAME_LIST}] \
    datasets.cfgs.root=${DATA_ROOT} \
    trainer.args.num_epoch=15 \
    trainer.args.train_batch_size=4

Output: Checkpoint and per-dataset retrieval predictions under outputs/qa_finetune/<date>/<time>/.

Retrieval Evaluation Only

To evaluate a pre-trained checkpoint (e.g., rmanluo/GFM-RAG-8M) without training:

Bash
N_GPU=4
DATA_ROOT="data"
CHECKPOINT="rmanluo/GFM-RAG-8M"  # or path to local checkpoint
HYDRA_FULL_ERROR=1 torchrun --nproc_per_node=${N_GPU} -m gfmrag.workflow.sft_training \
    load_model_from_pretrained=${CHECKPOINT} \
    datasets.cfgs.root=${DATA_ROOT} \
    datasets.train_names=[] \
    +trainer.args.eval_batch_size=1 \
    trainer.args.do_train=false \
    trainer.args.do_eval=true \
    trainer.args.do_predict=true

Output: predictions_<data_name>.json files under outputs/qa_finetune/<date>/<time>/.


Stage 3a: Single-Step QA Reasoning

Reads the retrieval predictions from Stage 2 and generates answers with an LLM in one shot.

Bash
bash scripts/gfm-rag/stage3_qa_inference.sh

The script expects two paths produced by Stage 2:

Bash
DATA_NAME="hotpotqa"   # hotpotqa | musique | 2wikimultihopqa
LLM="gpt-4o-mini"
RETRIEVED_RESULT_PATH="outputs/qa_finetune/latest/predictions_${DATA_NAME}_test.json"
NODE_PATH="data/${DATA_NAME}_test/processed/stage1/nodes.csv"

HYDRA_FULL_ERROR=1 python -m gfmrag.workflow.qa \
    qa_prompt=${DATA_NAME} \
    qa_evaluator=${DATA_NAME} \
    llm.model_name_or_path=${LLM} \
    test.n_threads=10 \
    test.top_k=5 \
    test.target_types=[document] \
    test.retrieved_result_path=${RETRIEVED_RESULT_PATH} \
    test.node_path=${NODE_PATH}

Per-dataset Commands

HotpotQA

Bash
python -m gfmrag.workflow.qa \
    qa_prompt=hotpotqa qa_evaluator=hotpotqa \
    test.retrieved_result_path=outputs/qa_finetune/latest/predictions_hotpotqa_test.json \
    test.node_path=data/hotpotqa_test/processed/stage1/nodes.csv

MuSiQue

Bash
python -m gfmrag.workflow.qa \
    qa_prompt=musique qa_evaluator=musique \
    test.retrieved_result_path=outputs/qa_finetune/latest/predictions_musique_test.json \
    test.node_path=data/musique_test/processed/stage1/nodes.csv

2WikiMultihopQA

Bash
python -m gfmrag.workflow.qa \
    qa_prompt=2wikimultihopqa qa_evaluator=2wikimultihopqa \
    test.retrieved_result_path=outputs/qa_finetune/latest/predictions_2wikimultihopqa_test.json \
    test.node_path=data/2wikimultihopqa_test/processed/stage1/nodes.csv

Output: outputs/qa_inference/<date>/<time>/prediction.jsonl


Stage 3b: Multi-Step IRCoT QA Reasoning

Runs iterative retrieval and reasoning (IRCoT) using the GFM-RAG retriever and an LLM agent. Unlike Stage 3a, this stage performs retrieval online and does not require pre-computed retrieval results.

Bash
bash scripts/gfm-rag/stage3_qa_ircot_inference.sh

Equivalent command:

Bash
DATA_NAME="hotpotqa"   # hotpotqa | musique | 2wikimultihopqa
LLM="gpt-4o-mini"
MODEL_PATH="rmanluo/GFM-RAG-8M"
MAX_STEPS=3

python -m gfmrag.workflow.qa_ircot_inference \
    dataset.root=data \
    dataset.data_name=${DATA_NAME}_test \
    graph_retriever.model_path=${MODEL_PATH} \
    llm.model_name_or_path=${LLM} \
    qa_prompt=${DATA_NAME} \
    qa_evaluator=${DATA_NAME} \
    agent_prompt=${DATA_NAME}_ircot \
    test.max_steps=${MAX_STEPS} \
    test.max_test_samples=-1

Per-dataset Commands

HotpotQA (2 reasoning steps)

Bash
python -m gfmrag.workflow.qa_ircot_inference \
    qa_prompt=hotpotqa qa_evaluator=hotpotqa \
    agent_prompt=hotpotqa_ircot \
    dataset.data_name=hotpotqa_test \
    test.max_steps=2

MuSiQue (4 reasoning steps)

Bash
python -m gfmrag.workflow.qa_ircot_inference \
    qa_prompt=musique qa_evaluator=musique \
    agent_prompt=musique_ircot \
    dataset.data_name=musique_test \
    test.max_steps=4

2WikiMultihopQA (2 reasoning steps)

Bash
python -m gfmrag.workflow.qa_ircot_inference \
    qa_prompt=2wikimultihopqa qa_evaluator=2wikimultihopqa \
    agent_prompt=2wikimultihopqa_ircot \
    dataset.data_name=2wikimultihopqa_test \
    test.max_steps=2

Output: outputs/qa_agent_inference/<data_name>/<date>/<time>/prediction.jsonl


Path Interpretation / Visualization

Visualize the reasoning paths found by the GFM retriever over the knowledge graph:

Bash
python -m gfmrag.workflow.experiments.visualize_path \
    dataset.data_name=hotpotqa_test

Expected Outputs Summary

Stage Output Location Contents
Stage 1 data/<data_name>/processed/stage1/ nodes.csv, relations.csv, edges.csv, train.json, test.json
Stage 2 (train) outputs/qa_finetune/<date>/<time>/ Model checkpoints, training logs
Stage 2 (eval) outputs/qa_finetune/<date>/<time>/ predictions_<data_name>.json per dataset
Stage 3a outputs/qa_inference/<date>/<time>/ prediction.jsonl with answers and scores
Stage 3b outputs/qa_agent_inference/<data_name>/<date>/<time>/ prediction.jsonl with answers and scores

Notes

  • Pre-built indexes: The test set processed/stage1/ directories are included in the provided download. You do not need to run Stage 1 for test sets.
  • Released checkpoint: rmanluo/GFM-RAG-8M can be loaded directly for retrieval evaluation and IRCoT inference, skipping Stage 2 training.
  • Stage 2 produces Stage 3a inputs: The QA script requires predictions_<data_name>.json from Stage 2 and a matching nodes.csv from processed/stage1/. The path outputs/qa_finetune/latest/ is a symlink to the most recent run.
  • HYDRA_FULL_ERROR=1: Set this variable to get full Python tracebacks instead of truncated Hydra error messages, which is useful for debugging.
  • Multi-GPU training: Stage 2 scripts use torchrun. Adjust --nproc_per_node to match your GPU count. Stage 3 scripts use a single process.
  • LLM API costs: Stage 1 and Stage 3 call the OpenAI API (GPT-4o-mini by default). Set llm.model_name_or_path to change the model.
  • For general usage outside these reproduction scripts, see the Workflow pages.