Skip to content

GFM-RAG Graph Index Configuration

This page documents the graph-index preset used by the GFM-RAG workflow family.

index_dataset.yaml

This preset is used by python -m gfmrag.workflow.index_dataset.

gfmrag/workflow/config/gfm_rag/index_dataset.yaml

gfmrag/workflow/config/gfm_rag/index_dataset.yaml
hydra:
  run:
    dir: outputs/kg_construction/${now:%Y-%m-%d}/${now:%H-%M-%S} # Output directory
  searchpath:
    - pkg://gfmrag.workflow.config

defaults:
  - _self_
  - ner_model: llm_ner_model # The NER model to use
  - openie_model: llm_openie_model # The OpenIE model to use
  - el_model: colbert_el_model # The EL model to use
  - graph_constructor: kg_constructor
  - sft_constructor: gfm_rag_sft_constructor

dataset:
  root: ./data # data root directory
  data_name: hotpotqa # data name
  force: False # Whether to force recompute the dataset

Top-level Fields

Parameter Options Note
hydra.run.dir outputs/kg_construction/${now:%Y-%m-%d}/${now:%H-%M-%S} Directory used by Hydra for runtime logs and outputs.
hydra.searchpath pkg://gfmrag.workflow.config Adds the packaged workflow config directory to Hydra's search path.
defaults List of config groups Selects the shared component presets used by indexing.
dataset Mapping Controls the dataset root, dataset name, and whether to force recomputation.

defaults Fields

Parameter Options Note
_self_ Current file Loads the local values in this preset.
ner_model llm_ner_model by default Named entity recognition preset used when building processed QA data.
openie_model llm_openie_model by default OpenIE preset used by the graph constructor.
el_model colbert_el_model by default Entity-linking preset used during graph and supervision construction.
graph_constructor kg_constructor by default Graph construction preset that builds the stage1 graph files.
sft_constructor gfm_rag_sft_constructor by default SFT constructor preset that prepares processed supervision files.

dataset Fields

Parameter Options Note
root Any valid data root Root directory that contains the dataset folder.
data_name Any dataset name Dataset name under root.
force True, False Whether to rebuild the processed outputs even if cached files already exist.