Data Format¶

This page defines the current dataset layouts consumed by the repository.

What This Step Does¶

It specifies the files you need before indexing, retrieval, QA, or training can run.

When You Need It¶

Read this page before:

building a new dataset from raw documents
reusing pre-built stage1 graph files
preparing training and evaluation examples

Supported Layouts¶

The framework supports two input layouts for each dataset:

Raw input files under raw/
Pre-built stage1 files under processed/stage1/

If only raw/ is provided, the indexing workflow will construct the graph files and processed QA files under processed/stage1/. If processed/stage1/ is already provided, the framework can consume those files directly without rebuilding stage1.

Option 1: Raw Input Layout¶

Text Only

root/
└── data_name/
    └── raw/
        ├── documents.json
        ├── train.json (optional)
        └── test.json (optional)

raw/documents.json is the raw graph source used to build stage1 graph files.
raw/train.json and raw/test.json are optional raw QA files.
When these QA files are provided, the workflow will generate processed/stage1/train.json and processed/stage1/test.json.

Option 2: Pre-built Stage1 Layout¶

Text Only

root/
└── data_name/
    └── processed/
        └── stage1/
            ├── nodes.csv
            ├── relations.csv
            ├── edges.csv
            ├── train.json (optional)
            └── test.json (optional)

processed/stage1/nodes.csv, relations.csv, and edges.csv are the graph files consumed by the framework.
processed/stage1/train.json and processed/stage1/test.json are the processed QA files consumed by the framework directly.

Graph Index File Structure¶

When using pre-built stage1 data, graph data consists of three CSV files:

nodes.csv: Defines nodes and their attributes.
relations.csv: Defines relationships and their attributes.
edges.csv: Defines edges between nodes and their attributes.

`nodes.csv` File Format¶

Field	Type	Description
name	str	Node name
type	str	Node type, e.g., `entity`, `document`, or `summary`
attributes	dict	(Optional) Additional node attributes, stored as a JSON string

The attributes field is a JSON-formatted string used to store arbitrary structured attributes.

Example Content (nodes.csv):

Text Only

name,type,attributes
"Barack Obama","entity","{}"
"White House","entity","{}"
"Obama Biography","document","{'title': 'The Life of Barack Obama', 'published_year': 2020}"

Text attributes for a document node:

Text Only

name: Obama Biography
type: document
title: The Life of Barack Obama
published_year: 2020

Text attributes for an entity node:

Text Only

name: Barack Obama
type: entity

`relations.csv` File Format¶

Field	Type	Description
name	str	Relation name
attributes	dict	(Optional) Additional relation attributes, stored as a JSON string

Example Content (relations.csv):

Text Only

name,attributes
lived_in,"{'description': 'A person has a habitual presence in a specific location.'}"
mentioned_in,"{'description': 'An entity is mentioned in the document'}"

Text attributes:

Text Only

name: lived_in
description: A person has a habitual presence in a specific location.

`edges.csv` File Format¶

Field	Type	Description
source	str	The `name` field of the source node
relation	str	The `name` field of the relation
target	str	The `name` field of the target node
attributes	dict	(Optional) Additional edge attributes, stored as a JSON string

source and target must appear in the name column of nodes.csv. relation must appear in the name column of relations.csv.

Example Content (edges.csv):

Text Only

source,relation,target,attributes
"Barack Obama","lived_in","White House","{'start_year': 2009, 'end_year': 2017}"
"Barack Obama","mentioned_in","Obama Biography",{}

Text attributes:

Text Only

start_year: 2009
end_year: 2017

Complete Example Graph Structure¶

Nodes (nodes.csv):

name	type	attributes
Barack Obama	entity	`{"birth_date": "1961-08-04", "nationality": "USA"}`
White House	entity	`{"location": "Washington, D.C."}`
Obama Biography	document	`{"title": "The Life of Barack Obama", "published_year": 2020}`
Summary_node_1	summary	`{"summary": "...", "title": "..."}`

Relations (relations.csv):

name	attributes
lived_in	`{"description": "A person has a habitual presence in a specific location."}`
mentioned_in	`{"description": "An entity is mentioned in the document"}`

Edges (edges.csv):

source	relation	target	attributes
Barack Obama	lived_in	White House	`{"start_year": 2009, "end_year": 2017}`
Barack Obama	mentioned_in	Obama Biography	`{}`

Raw QA Files: `raw/train.json` and `raw/test.json`¶

Raw QA files are optional inputs under raw/. When provided, the workflow processes them into stage1 QA files.

Field	Type	Description
id	str	A unique identifier for the example
question	str	The question or query
answer	str	(Optional) The ground-truth answer; recommended for evaluation
answer_aliases	list[str]	(Optional) Alternative acceptable answers
supporting_documents	list[str]	(Optional) Document names supporting the answer; useful for supervision
Additional fields	Any	Any extra fields are preserved into processed outputs when possible

Example (raw/test.json):

JSON

[
  {
    "id": "toy-1",
    "question": "What is the capital of France?",
    "answer": "Paris",
    "answer_aliases": ["City of Paris"],
    "supporting_documents": ["France", "Paris"]
  }
]

Processed QA Files: `processed/stage1/train.json` and `processed/stage1/test.json`¶

These are the stage1 QA files consumed by the framework directly. They can either:

be generated automatically from raw/train.json and raw/test.json, or
be provided directly under processed/stage1/

Field	Type	Description
id	str	A unique identifier for the example
question	str	The question or query
start_nodes	dict[str, list]	Starting nodes grouped by type. Key: node type, Value: list of node names
target_nodes	dict[str, list]	Target nodes grouped by type. Key: node type, Value: list of node names
Additional fields	Any	Any extra fields copied from the raw data

Example (processed/stage1/test.json):

JSON

[
  {
    "id": "5abc553a554299700f9d7871",
    "question": "Kyle Ezell is a professor at what School of Architecture building at Ohio State?",
    "answer": "Knowlton Hall",
    "start_nodes": {
      "entity": [
        "kyle ezell",
        "architectural association school of architecture",
        "ohio state"
      ]
    },
    "target_nodes": {
      "document": [
        "Knowlton Hall",
        "Kyle Ezell"
      ],
      "entity": [
        "10 million donation",
        "2004",
        "architecture",
        "austin e  knowlton",
        "austin e  knowlton school of architecture",
        "bachelor s in architectural engineering",
        "city and regional planning",
        "columbus  ohio  united states",
        "ives hall",
        "july 2002",
        "knowlton hall",
        "ksa",
        "landscape architecture",
        "ohio",
        "replacement for ives hall",
        "the ohio state university"
      ]
    }
  }
]

Minimal Raw Example¶

`raw/documents.json`¶

JSON

{
  "France": "France is a country in Western Europe. Paris is its capital.",
  "Paris": "Paris is the capital and most populous city of France."
}

`raw/test.json`¶

JSON

[
  {
    "id": "toy-1",
    "question": "What is the capital of France?",
    "answer": "Paris",
    "answer_aliases": ["City of Paris"],
    "supporting_documents": ["France", "Paris"]
  }
]

Minimal Processed Stage1 Example¶

`processed/stage1/nodes.csv`¶

Text Only

name,type,attributes
France,document,"{}"
Paris,document,"{}"
capital,entity,"{}"

`processed/stage1/relations.csv`¶

Text Only

name,attributes
mentions,"{}"

`processed/stage1/edges.csv`¶

Text Only

source,relation,target,attributes
France,mentions,capital,"{}"
Paris,mentions,capital,"{}"

`processed/stage1/test.json`¶

JSON

[
  {
    "id": "toy-1",
    "question": "What is the capital of France?",
    "answer": "Paris",
    "answer_aliases": ["City of Paris"],
    "supporting_documents": ["France", "Paris"],
    "start_nodes": {
      "entity": ["capital"]
    },
    "target_nodes": {
      "document": ["France", "Paris"]
    }
  }
]

Outputs Used By Later Steps¶

Index consumes raw/ or processed/stage1/
Retrieval and QA consumes processed/stage1/ plus retrieval outputs
Training consumes processed stage1 data and dataset lists

Common Pitfalls¶

raw/documents.json must exist if you expect GFMRetriever.from_index(...) or index_dataset to build stage1 automatically.
nodes.csv, relations.csv, and edges.csv must stay consistent with each other when you provide pre-built stage1 files.
source and target in edges.csv must match the name column in nodes.csv; relation must match the name column in relations.csv.
Downstream QA examples often assume answer_aliases and supporting_documents are present, even though they are not mandatory for plain retrieval.

Data Format¶

What This Step Does¶

When You Need It¶

Supported Layouts¶

Option 1: Raw Input Layout¶

Option 2: Pre-built Stage1 Layout¶

Graph Index File Structure¶

nodes.csv File Format¶

relations.csv File Format¶

edges.csv File Format¶

Complete Example Graph Structure¶

Raw QA Files: raw/train.json and raw/test.json¶

Processed QA Files: processed/stage1/train.json and processed/stage1/test.json¶

Minimal Raw Example¶

raw/documents.json¶

raw/test.json¶

Minimal Processed Stage1 Example¶

processed/stage1/nodes.csv¶

processed/stage1/relations.csv¶

processed/stage1/edges.csv¶

processed/stage1/test.json¶

Outputs Used By Later Steps¶

Common Pitfalls¶

`nodes.csv` File Format¶

`relations.csv` File Format¶

`edges.csv` File Format¶

Raw QA Files: `raw/train.json` and `raw/test.json`¶

Processed QA Files: `processed/stage1/train.json` and `processed/stage1/test.json`¶

`raw/documents.json`¶

`raw/test.json`¶

`processed/stage1/nodes.csv`¶

`processed/stage1/relations.csv`¶

`processed/stage1/edges.csv`¶

`processed/stage1/test.json`¶