KG-index Construction

This guide explains how to create a knowledge graph index and process QA data that can be used by GFM-RAG for training and testing.

Data Preparation¶

Please follow the instructions in the Data Preparation to prepare your dataset in the following structure:

Text Only

data_name/
├── raw/
│   ├── dataset_corpus.json
│   ├── train.json # (optional)
│   └── test.json # (optional)
└── processed/ # Output directory

Config¶

You need to create a KG-index configuration file.

gfmrag/workflow/config/stage1_index_dataset.yaml

hydra:
  run:
    dir: outputs/kg_construction/${now:%Y-%m-%d}/${now:%H-%M-%S} # Output directory

defaults:
  - _self_
  - ner_model: llm_ner_model # The NER model to use
  - openie_model: llm_openie_model # The OpenIE model to use
  - el_model: colbert_el_model # The EL model to use

dataset:
  root: ./data # data root directory
  data_name: hotpotqa # data name

kg_constructor:
  _target_: gfmrag.kg_construction.KGConstructor # The KGConstructor class
  open_ie_model: ${openie_model}
  ner_model: ${ner_model}
  el_model: ${el_model}
  root: tmp/kg_construction # Temporary directory for storing intermediate files during KG construction
  num_processes: 10 # Number of processes to use
  cosine_sim_edges: True # Whether to conduct entities resolution using cosine similarity
  threshold: 0.8 # Threshold for cosine similarity
  max_sim_neighbors: 100 # Maximum number of similar neighbors to add
  add_title: True # Whether to add the title to the content of the document during OpenIE
  force: False # Whether to force recompute the KG

qa_constructor:
  _target_: gfmrag.kg_construction.QAConstructor # The QAConstructor class
  root: tmp/qa_construction # Temporary directory for storing intermediate files during QA construction
  ner_model: ${ner_model}
  el_model: ${el_model}
  num_processes: 10 # Number of processes to use
  force: False # Whether to force recompute the QA data

Details of the configuration parameters are explained in the KG-index Configuration page.

Index dataset¶

This method performs two main tasks:

Creates and saves knowledge graph related files (kg.txt and document2entities.json) from the dataset_corpus.json file
Identify the query entities and supporting entities in training and testing data if available in the raw data directory

Files created:

kg.txt: Contains knowledge graph triples
document2entities.json: Maps documents to their entities
train.json: Processed training data (if raw exists)
test.json: Processed test data (if raw exists)

Directory structure:

Text Only

root/
└── data_name/
    ├── raw/
    │   ├── dataset_corpus.json
    │   ├── train.json (optional)
    │   └── test.json (optional)
    └── processed/
        └── stage1/
            ├── kg.txt
            ├── document2entities.json
            ├── train.json
            └── test.json

To index the data, run the following command:

gfmrag/workflow/stage1_index_dataset.py

import logging
import os

import dotenv
import hydra
from hydra.core.hydra_config import HydraConfig
from omegaconf import DictConfig, OmegaConf

from gfmrag import KGIndexer
from gfmrag.kg_construction import KGConstructor, QAConstructor

logger = logging.getLogger(__name__)

dotenv.load_dotenv()


@hydra.main(config_path="config", config_name="stage1_index_dataset", version_base=None)
def main(cfg: DictConfig) -> None:
    output_dir = HydraConfig.get().runtime.output_dir
    logger.info(f"Config:\n {OmegaConf.to_yaml(cfg)}")
    logger.info(f"Current working directory: {os.getcwd()}")
    logger.info(f"Output directory: {output_dir}")

    kg_constructor = KGConstructor.from_config(cfg.kg_constructor)
    qa_constructor = QAConstructor.from_config(cfg.qa_constructor)

    kg_indexer = KGIndexer(kg_constructor, qa_constructor)
    kg_indexer.index_data(cfg.dataset)


if __name__ == "__main__":
    main()

Bash

python -m gfmrag.workflow.stage1_index_dataset

You can overwrite the configuration like this:

Bash

python -m gfmrag.workflow.stage1_index_dataset kg_constructor.num_processes=5

Output Files¶

`kg.txt`¶

The kg.txt file contains the knowledge graph triples in the following format:

Text Only

subject,relation,object

Example:

Text Only

fred gehrke,was,american football player
fred gehrke,was,executive
fred gehrke,played for,cleveland   los angeles rams

`document2entities.json`¶

The document2entities.json file contains the mapping of documents to their entities in the following format:

key: The title or unique id of the document.
value: A list of entities in the document.

Example:

JSON

href="#__codelineno-8-1">{ "Fred Gehrke": [ "1977", "1981", "american football player", "chicago cardinals", "cleveland los angeles rams", "executive", "first painted logo on helmets", "fred gehrke", "gehrke", "general manager of denver broncos", "los angeles rams", "los angeles rams logo", "miami marlin christian yelich", "san francisco 49ers" ], "Manny Machado": [ "2010 major league baseball draft", "american", "baltimore orioles", "brito high school", "july 6 1992", "major league baseball", "manny machado", "right handed" ], }

`train.json` and `test.json`¶

The train.json and test.json files contain the processed training and test data in the following format:

id: A unique identifier for the example.
question: The question or query.
supporting_facts: A list of supporting facts for the question. Each supporting fact is a list containing the title of the document that can be found in the dataset_corpus.json file.
question_entities: A list of entities in the question.
supporting_entities: A list of entities in the supporting facts.
Additional fields copied from the raw data

Examples

JSON

[
    {
        "id": "5abc553a554299700f9d7871",
        "question": "Kyle Ezell is a professor at what School of Architecture building at Ohio State?",
        "answer": "Knowlton Hall",
        "supporting_facts": [
            "Knowlton Hall",
            "Kyle Ezell"
        ],
        "question_entities": [
            "kyle ezell",
            "architectural association school of architecture",
            "ohio state"
        ],
        "supporting_entities": [
            "10 million donation",
            "2004",
            "architecture",
            "austin e  knowlton",
            "austin e  knowlton school of architecture",
            "bachelor s in architectural engineering",
            "city and regional planning",
            "columbus  ohio  united states",
            "ives hall",
            "july 2002",
            "knowlton hall",
            "ksa",
            "landscape architecture",
            "ohio",
            "replacement for ives hall",
            "the ohio state university",
            "the ohio state university in 1931",
            "american urban planning practitioner",
            "expressing local culture",
            "knowlton school",
            "kyle ezell",
            "lawrenceburg  tennessee",
            "professor",
            "the ohio state university",
            "theorist",
            "undergraduate planning program",
            "vibrant downtowns",
            "writer"
        ]
    },
    ...
]

KG-index Construction

Data Preparation¶

Config¶

Index dataset¶

Output Files¶

kg.txt¶

document2entities.json¶

train.json and test.json¶

`kg.txt`¶

`document2entities.json`¶

`train.json` and `test.json`¶