Index¶

This page covers the current indexing entrypoint.

What This Step Does¶

python -m gfmrag.workflow.index_dataset builds the graph files and processed QA data needed by retrieval, QA, and training.

Run indexing when:

If you already have a complete processed/stage1/, you can skip this step.

dataset root and dataset name
the current indexing config family under gfmrag/workflow/config/gfm_rag/index_dataset.yaml
graph constructor and SFT constructor component configs

Indexing writes into data/<data_name>/processed/stage1/.

Typical outputs include:

Hydra also writes run metadata under outputs/kg_construction/<date>/<time>/.

Bash

python -m gfmrag.workflow.index_dataset \
  dataset.root=./data \
  dataset.data_name=toy_raw

The default indexing entrypoint is configured under:

Related component config groups live under:

For parameter descriptions, use the Config overview and the workflow-specific graph-index pages instead of copying the full YAML into this guide.

Re-index when:

If raw/documents.json is missing, automatic stage1 construction cannot run.
The temporary constructor directories depend on the resolved config, so changing component configs will create new fingerprints.
Training and QA assume the stage1 files and processed samples match the same dataset version.