Skip to content

Weights & Biases Integration

This document explains how to use the Weights & Biases (wandb) integration in GFM-RAG for tracking training experiments.

Overview

GFM-RAG now includes comprehensive wandb integration that automatically logs:

  • Training Loss: Real-time loss values during training
  • Evaluation Metrics: Validation and test results
  • Model Configuration: All hyperparameters and settings
  • Model Checkpoints: Best model and epoch-wise checkpoints as artifacts
  • System Information: Hardware and environment details

Quick Start

1. Install and Setup Wandb

Bash
# wandb is already included in the dependencies
# Log in to your wandb account
wandb login

2. Basic Usage

The integration is enabled by default. Simply run your training as usual:

Bash
# Pre-training with wandb logging
python -m gfmrag.workflow.kgc_training --config-path config/gfm_rag

# Fine-tuning with wandb logging
python -m gfmrag.workflow.sft_training --config-path config/gfm_rag

3. Configure Wandb Settings

You can customize wandb settings in the configuration files:

Using Command Line Overrides

Bash
# Set custom project name
python -m gfmrag.workflow.kgc_training --config-path config/gfm_rag wandb.project="my-experiment"

# Add custom tags
python -m gfmrag.workflow.kgc_training --config-path config/gfm_rag wandb.tags=["experiment1","baseline"]

# Set run name
python -m gfmrag.workflow.kgc_training --config-path config/gfm_rag wandb.name="baseline-run-1"

# Disable wandb logging
python -m gfmrag.workflow.kgc_training --config-path config/gfm_rag wandb.enabled=false

Editing Configuration Files

Edit gfmrag/workflow/config/wandb/default.yaml:

YAML
enabled: true
project: "my-gfm-rag-project"
entity: "my-team"  # Your team name or username
name: null  # Will auto-generate
group: "experiment-1"  # Group related runs
tags: ["baseline", "v1"]
notes: "Baseline experiment with default settings"

Configuration Options

Core Settings

Parameter Options Note
enabled true, false Whether to enable wandb logging.
project Any string Project name in wandb. Varies by script.
entity Any string or null Team or username for the project. null uses the default account.
name Any string or null Run name. null auto-generates a name.
group Any string or null Group name for organizing related runs.
tags List of strings Tags attached to the run.
notes Any string Description of the experiment.

Advanced Settings

You can also configure additional wandb settings by editing the default configuration:

YAML
# Directory to save wandb files
dir: "./wandb_logs"

# Mode: "online", "offline", or "disabled"
mode: "online"

# Save source code with the run
save_code: true

# Log frequency for model watching
log_frequency: 100

Logged Metrics

Pre-training (kgc_training.py)

Training Metrics:

  • train/loss - Binary cross-entropy loss (per batch)
  • train/epoch_loss - Average loss per epoch
  • train/epoch - Current epoch number

Evaluation Metrics:

  • eval/{dataset_name}/mr - Mean rank
  • eval/{dataset_name}/mrr - Mean reciprocal rank
  • eval/{dataset_name}/hits@{k} - Hits at k (k=1,3,10)
  • eval/mrr - Overall MRR
  • eval/best_mrr - Best MRR achieved

Model Artifacts:

  • Best model checkpoints with metadata
  • Epoch-wise checkpoints (if save_best_only=false)

Fine-tuning (sft_training.py)

Training Metrics:

  • train/{loss_name} - Individual loss components (per batch)
  • train/epoch_{loss_name} - Average losses per epoch
  • train/epoch - Current epoch number

Evaluation Metrics:

  • eval/{watched_metric} - Primary evaluation metric
  • eval/best_{watched_metric} - Best value achieved
  • test/{dataset_name}/{metric} - Final test results

Model Artifacts:

  • Best model checkpoints with metadata
  • Epoch-wise checkpoints (if save_best_only=false)

Multi-GPU Training

The wandb integration is multi-GPU aware: - Only rank 0 process logs to wandb (prevents duplicate logs) - All metrics are properly synchronized across processes - Model checkpoints are saved only on rank 0

Bash
# Multi-GPU training with wandb
torchrun --nproc_per_node=4 -m gfmrag.workflow.kgc_training --config-path config/gfm_rag

Best Practices

1. Organize Experiments

Use groups and tags to organize related experiments:

Bash
# Hyperparameter sweep
python -m gfmrag.workflow.sft_training --config-path config/gfm_rag wandb.group="hp_sweep" wandb.tags=["lr_0.001"]
python -m gfmrag.workflow.sft_training --config-path config/gfm_rag wandb.group="hp_sweep" wandb.tags=["lr_0.0005"]

2. Meaningful Run Names

Set descriptive run names for important experiments:

Bash
python -m gfmrag.workflow.sft_training --config-path config/gfm_rag wandb.name="baseline_hotpot_qa_v1"

3. Add Experiment Notes

Include context about your experiments:

Bash
python -m gfmrag.workflow.sft_training --config-path config/gfm_rag wandb.notes="Testing new loss function combination"

Troubleshooting

Common Issues

1. Login Required

Bash
wandb login

2. Disable Wandb for Debugging

Bash
python -m gfmrag.workflow.kgc_training --config-path config/gfm_rag wandb.enabled=false

3. Offline Mode If you want to log locally without uploading:

Bash
export WANDB_MODE=offline
python -m gfmrag.workflow.kgc_training --config-path config/gfm_rag

4. Network Issues For environments with restricted internet access:

Bash
export WANDB_MODE=disabled
python -m gfmrag.workflow.stage2_kg_pretrain

Verify Integration

Check that wandb is working correctly:

Python
import wandb

print(f"Wandb version: {wandb.__version__}")
print(f"Wandb is available: {wandb.api.api_key is not None}")

Example Workflow

Bash
# 1. Start pre-training with custom settings
python -m gfmrag.workflow.kgc_training --config-path config/gfm_rag \\
    wandb.project="gfm-rag-experiments" \\
    wandb.group="pretraining" \\
    wandb.tags=["baseline","hotpot"] \\
    wandb.name="pretrain_baseline_v1"

# 2. Fine-tune using the pre-trained model
python -m gfmrag.workflow.sft_training --config-path config/gfm_rag \\
    wandb.project="gfm-rag-experiments" \\
    wandb.group="finetuning" \\
    wandb.tags=["finetune","hotpot"] \\
    wandb.name="finetune_baseline_v1" \\
    load_model_from_pretrained="path/to/pretrained/model"

This will create a complete experiment tracking workflow where you can easily compare different runs, track improvements, and share results with your team.