Index¶
This page covers the current indexing entrypoint.
What This Step Does¶
python -m gfmrag.workflow.index_dataset builds the graph files and processed QA data needed by retrieval, QA, and training.
When You Need It¶
Run indexing when:
- you only have
raw/files - you changed the graph-construction setup
- you want to regenerate stage1 files with a different constructor or model stack
If you already have a complete processed/stage1/, you can skip this step.
Inputs¶
- dataset root and dataset name
- the current indexing config family under
gfmrag/workflow/config/gfm_rag/index_dataset.yaml - graph constructor and SFT constructor component configs
Outputs¶
Indexing writes into data/<data_name>/processed/stage1/.
Typical outputs include:
nodes.csvrelations.csvedges.csv- processed
train.json - processed
test.json
Hydra also writes run metadata under outputs/kg_construction/<date>/<time>/.
Minimal Example¶
Current Config Family¶
The default indexing entrypoint is configured under:
Related component config groups live under:
gfmrag/workflow/config/ner_model/gfmrag/workflow/config/openie_model/gfmrag/workflow/config/el_model/gfmrag/workflow/config/graph_constructor/gfmrag/workflow/config/sft_constructor/
For parameter descriptions, use the Config overview and the workflow-specific graph-index pages instead of copying the full YAML into this guide.
When To Re-index¶
Re-index when:
- raw documents changed
- the graph constructor changed
- the entity-linking or OpenIE setup changed
- you want to rebuild processed QA files from updated raw train/test files
Common Pitfalls¶
- If
raw/documents.jsonis missing, automatic stage1 construction cannot run. - The temporary constructor directories depend on the resolved config, so changing component configs will create new fingerprints.
- Training and QA assume the stage1 files and processed samples match the same dataset version.