Data Format¶
This page defines the current dataset layouts consumed by the repository.
What This Step Does¶
It specifies the files you need before indexing, retrieval, QA, or training can run.
When You Need It¶
Read this page before:
- building a new dataset from raw documents
- reusing pre-built stage1 graph files
- preparing training and evaluation examples
Supported Layouts¶
The framework supports two input layouts for each dataset:
- Raw input files under
raw/ - Pre-built stage1 files under
processed/stage1/
If only raw/ is provided, the indexing workflow will construct the graph files and processed QA files under processed/stage1/.
If processed/stage1/ is already provided, the framework can consume those files directly without rebuilding stage1.
Option 1: Raw Input Layout¶
root/
└── data_name/
└── raw/
├── documents.json
├── train.json (optional)
└── test.json (optional)
raw/documents.jsonis the raw graph source used to build stage1 graph files.raw/train.jsonandraw/test.jsonare optional raw QA files.- When these QA files are provided, the workflow will generate
processed/stage1/train.jsonandprocessed/stage1/test.json.
Option 2: Pre-built Stage1 Layout¶
root/
└── data_name/
└── processed/
└── stage1/
├── nodes.csv
├── relations.csv
├── edges.csv
├── train.json (optional)
└── test.json (optional)
processed/stage1/nodes.csv,relations.csv, andedges.csvare the graph files consumed by the framework.processed/stage1/train.jsonandprocessed/stage1/test.jsonare the processed QA files consumed by the framework directly.
Graph Index File Structure¶
When using pre-built stage1 data, graph data consists of three CSV files:
nodes.csv: Defines nodes and their attributes.relations.csv: Defines relationships and their attributes.edges.csv: Defines edges between nodes and their attributes.
nodes.csv File Format¶
| Field | Type | Description |
|---|---|---|
| name | str | Node name |
| type | str | Node type, e.g., entity, document, or summary |
| attributes | dict | (Optional) Additional node attributes, stored as a JSON string |
The
attributesfield is a JSON-formatted string used to store arbitrary structured attributes.
Example Content (nodes.csv):
name,type,attributes
"Barack Obama","entity","{}"
"White House","entity","{}"
"Obama Biography","document","{'title': 'The Life of Barack Obama', 'published_year': 2020}"
Text attributes for a document node:
Text attributes for an entity node:
relations.csv File Format¶
| Field | Type | Description |
|---|---|---|
| name | str | Relation name |
| attributes | dict | (Optional) Additional relation attributes, stored as a JSON string |
Example Content (relations.csv):
name,attributes
lived_in,"{'description': 'A person has a habitual presence in a specific location.'}"
mentioned_in,"{'description': 'An entity is mentioned in the document'}"
Text attributes:
edges.csv File Format¶
| Field | Type | Description |
|---|---|---|
| source | str | The name field of the source node |
| relation | str | The name field of the relation |
| target | str | The name field of the target node |
| attributes | dict | (Optional) Additional edge attributes, stored as a JSON string |
sourceandtargetmust appear in thenamecolumn ofnodes.csv.relationmust appear in thenamecolumn ofrelations.csv.
Example Content (edges.csv):
source,relation,target,attributes
"Barack Obama","lived_in","White House","{'start_year': 2009, 'end_year': 2017}"
"Barack Obama","mentioned_in","Obama Biography",{}
Text attributes:
Complete Example Graph Structure¶
Nodes (nodes.csv):
| name | type | attributes |
|---|---|---|
| Barack Obama | entity | {"birth_date": "1961-08-04", "nationality": "USA"} |
| White House | entity | {"location": "Washington, D.C."} |
| Obama Biography | document | {"title": "The Life of Barack Obama", "published_year": 2020} |
| Summary_node_1 | summary | {"summary": "...", "title": "..."} |
Relations (relations.csv):
| name | attributes |
|---|---|
| lived_in | {"description": "A person has a habitual presence in a specific location."} |
| mentioned_in | {"description": "An entity is mentioned in the document"} |
Edges (edges.csv):
| source | relation | target | attributes |
|---|---|---|---|
| Barack Obama | lived_in | White House | {"start_year": 2009, "end_year": 2017} |
| Barack Obama | mentioned_in | Obama Biography | {} |
Raw QA Files: raw/train.json and raw/test.json¶
Raw QA files are optional inputs under raw/. When provided, the workflow processes them into stage1 QA files.
| Field | Type | Description |
|---|---|---|
| id | str | A unique identifier for the example |
| question | str | The question or query |
| answer | str | (Optional) The ground-truth answer; recommended for evaluation |
| answer_aliases | list[str] | (Optional) Alternative acceptable answers |
| supporting_documents | list[str] | (Optional) Document names supporting the answer; useful for supervision |
| Additional fields | Any | Any extra fields are preserved into processed outputs when possible |
Example (raw/test.json):
[
{
"id": "toy-1",
"question": "What is the capital of France?",
"answer": "Paris",
"answer_aliases": ["City of Paris"],
"supporting_documents": ["France", "Paris"]
}
]
Processed QA Files: processed/stage1/train.json and processed/stage1/test.json¶
These are the stage1 QA files consumed by the framework directly. They can either:
- be generated automatically from
raw/train.jsonandraw/test.json, or - be provided directly under
processed/stage1/
| Field | Type | Description |
|---|---|---|
| id | str | A unique identifier for the example |
| question | str | The question or query |
| start_nodes | dict[str, list] | Starting nodes grouped by type. Key: node type, Value: list of node names |
| target_nodes | dict[str, list] | Target nodes grouped by type. Key: node type, Value: list of node names |
| Additional fields | Any | Any extra fields copied from the raw data |
Example (processed/stage1/test.json):
[
{
"id": "5abc553a554299700f9d7871",
"question": "Kyle Ezell is a professor at what School of Architecture building at Ohio State?",
"answer": "Knowlton Hall",
"start_nodes": {
"entity": [
"kyle ezell",
"architectural association school of architecture",
"ohio state"
]
},
"target_nodes": {
"document": [
"Knowlton Hall",
"Kyle Ezell"
],
"entity": [
"10 million donation",
"2004",
"architecture",
"austin e knowlton",
"austin e knowlton school of architecture",
"bachelor s in architectural engineering",
"city and regional planning",
"columbus ohio united states",
"ives hall",
"july 2002",
"knowlton hall",
"ksa",
"landscape architecture",
"ohio",
"replacement for ives hall",
"the ohio state university"
]
}
}
]
Minimal Raw Example¶
raw/documents.json¶
{
"France": "France is a country in Western Europe. Paris is its capital.",
"Paris": "Paris is the capital and most populous city of France."
}
raw/test.json¶
[
{
"id": "toy-1",
"question": "What is the capital of France?",
"answer": "Paris",
"answer_aliases": ["City of Paris"],
"supporting_documents": ["France", "Paris"]
}
]
Minimal Processed Stage1 Example¶
processed/stage1/nodes.csv¶
processed/stage1/relations.csv¶
processed/stage1/edges.csv¶
processed/stage1/test.json¶
[
{
"id": "toy-1",
"question": "What is the capital of France?",
"answer": "Paris",
"answer_aliases": ["City of Paris"],
"supporting_documents": ["France", "Paris"],
"start_nodes": {
"entity": ["capital"]
},
"target_nodes": {
"document": ["France", "Paris"]
}
}
]
Outputs Used By Later Steps¶
- Index consumes
raw/orprocessed/stage1/ - Retrieval and QA consumes
processed/stage1/plus retrieval outputs - Training consumes processed stage1 data and dataset lists
Common Pitfalls¶
raw/documents.jsonmust exist if you expectGFMRetriever.from_index(...)orindex_datasetto build stage1 automatically.nodes.csv,relations.csv, andedges.csvmust stay consistent with each other when you provide pre-built stage1 files.sourceandtargetinedges.csvmust match thenamecolumn innodes.csv;relationmust match thenamecolumn inrelations.csv.- Downstream QA examples often assume
answer_aliasesandsupporting_documentsare present, even though they are not mandatory for plain retrieval.