Week 06 — Q2

Dataset Preparation, Folder Structuring & Interpreting Logs

Theory is over — this is the hands-on week. You’ll take raw BraTS data, convert it into the exact format nnU-Net expects, launch your first real training runs, and learn to read training logs like a diagnostic tool. Most beginners get stuck here, so we’ll cover every common pitfall.

From DICOM to NIfTI: The First Step

If you’re working with BraTS challenge data, you can skip this section — BraTS data is already in NIfTI format. But if you ever work with clinical data from a hospital, it will arrive as DICOM files, and converting them correctly is the first critical step.

DICOM (Digital Imaging and Communications in Medicine) is the universal format used by MRI scanners, PACS systems, and hospitals. Each scan produces hundreds or thousands of individual DICOM files (one per slice). Converting to NIfTI combines these into a single 3D volume file with standardized metadata. The standard tool is dcm2niix.

⚠️
Conversion pitfalls that will silently break your model: Different MRI manufacturers (Siemens, GE, Philips) encode DICOM metadata differently, sometimes in private rather than public tags. Common errors include left-right flips (your model learns mirror-image anatomy), incorrect slice ordering (slices assembled in wrong order), and missing metadata (voxel spacing defaults to 1.0mm when the real spacing is different). A 2016 study documented these issues across manufacturers and modalities. Always verify orientation visually in 3D Slicer after conversion.
# Converting DICOM to NIfTI with dcm2niix
dcm2niix -z y -o /output/path /dicom/input/folder

# -z y : compress output as .nii.gz
# Always verify the output!
python -c "
import nibabel as nib
img = nib.load('output.nii.gz')
print('Shape:', img.shape)
print('Spacing:', img.header.get_zooms())
print('Orientation:', nib.aff2axcodes(img.affine))
"

Setting Up the nnU-Net Dataset Structure

This is where most beginners spend the most time debugging. nnU-Net is extremely strict about its folder structure and naming conventions. One wrong filename suffix, one misnamed folder, and the pipeline will either fail with a cryptic error or silently produce wrong results.

The Required Folder Layout

nnUNet_raw/
  Dataset001_BraTS/
    ├── imagesTr/                 # Training images
    │  ├── BraTS_001_0000.nii.gz  # Patient 001, T1 (channel 0)
    │  ├── BraTS_001_0001.nii.gz  # Patient 001, T1ce (channel 1)
    │  ├── BraTS_001_0002.nii.gz  # Patient 001, T2 (channel 2)
    │  ├── BraTS_001_0003.nii.gz  # Patient 001, FLAIR (channel 3)
    │  ├── BraTS_002_0000.nii.gz  # Patient 002, T1
    │  ├── BraTS_002_0001.nii.gz  # Patient 002, T1ce
    │  └── ...                    # 4 files per patient
    ├── labelsTr/                 # Training labels
    │  ├── BraTS_001.nii.gz      # Patient 001 segmentation
    │  ├── BraTS_002.nii.gz      # Patient 002 segmentation
    │  └── ...                    # NO channel suffix on labels
    ├── imagesTs/                 # (Optional) Test images
    └── dataset.json             # Metadata file

The Channel Suffix Convention

The _0000, _0001, _0002, _0003 suffixes tell nnU-Net which modality is which. This mapping must be consistent across all patients. If T1 is _0000 for patient 001, it must be _0000 for every single patient. Mixing this up is one of the most common and devastating errors.

BraTS Channel Mapping

_0000 → T1-weighted (T1)

_0001 → T1 contrast-enhanced (T1ce / T1Gd)

_0002 → T2-weighted (T2)

_0003 → Fluid-Attenuated Inversion Recovery (FLAIR)

The dataset.json File

{
  "channel_names": {
    "0": "T1",
    "1": "T1ce",
    "2": "T2",
    "3": "FLAIR"
  },
  "labels": {
    "background": 0,
    "NCR": 1,
    "ED": 2,
    "ET": 3
  },
  "numTraining": 1251,
  "file_ending": ".nii.gz"
}
💡
BraTS label gotcha: Historically, BraTS labels use 0 (background), 1 (NCR/NET), 2 (edema), and 4 (enhancing tumor) — note that label 3 is skipped. nnU-Net v2 can handle non-contiguous labels, but if you’re remapping to 0/1/2/3, make sure the mapping is correct. The evaluation regions are: WT = labels {1,2,4}, TC = labels {1,4}, ET = label {4}.

Converting BraTS Data to nnU-Net Format

BraTS organizes data by patient folder with named modality files. nnU-Net needs a flat directory with channel suffixes. Here’s how to bridge the two:

import os, shutil, json

# BraTS structure: BraTS2021_00000/BraTS2021_00000_t1.nii.gz
# nnU-Net needs: imagesTr/BraTS2021_00000_0000.nii.gz

brats_dir = "/path/to/BraTS2021/training"
nnunet_raw = "/path/to/nnUNet_raw/Dataset001_BraTS"

os.makedirs(f"{nnunet_raw}/imagesTr", exist_ok=True)
os.makedirs(f"{nnunet_raw}/labelsTr", exist_ok=True)

# Modality-to-channel mapping
modality_map = {
  "t1": "0000",    # T1-weighted
  "t1ce": "0001",  # T1 contrast-enhanced
  "t2": "0002",    # T2-weighted
  "flair": "0003"# FLAIR
}

for patient in sorted(os.listdir(brats_dir)):
  patient_dir = os.path.join(brats_dir, patient)
  if not os.path.isdir(patient_dir): continue

  # Copy modality files with channel suffix
  for mod, suffix in modality_map.items():
    src = os.path.join(patient_dir, f"{patient}_{mod}.nii.gz")
    dst = os.path.join(nnunet_raw, "imagesTr",
          f"{patient}_{suffix}.nii.gz")
    shutil.copy2(src, dst)

  # Copy segmentation label (NO channel suffix)
  seg_src = os.path.join(patient_dir, f"{patient}_seg.nii.gz")
  seg_dst = os.path.join(nnunet_raw, "labelsTr",
            f"{patient}.nii.gz")
  shutil.copy2(seg_src, seg_dst)

print(f"Converted {len(os.listdir(nnunet_raw + '/labelsTr'))} patients")

After running this, always verify: count files in imagesTr (should be 4× the number of patients), count files in labelsTr (should equal number of patients), and spot-check a few by loading image + label in 3D Slicer to confirm alignment.

Label Verification: The Step Nobody Skips Twice

A study on synthetic data imperfections found that label quality is more important than image quality for segmentation model performance. Degraded labels caused substantial performance drops, while degraded images actually made models more robust. Another study found that automated label error detection improved model accuracy by up to 45%. Take label verification seriously.

Count unique label values in every segmentation mask. For BraTS, you should see {0, 1, 2, 4} (or {0, 1, 2, 3} if remapped). Any unexpected values (negative numbers, floats, label 5) indicate corruption.
Visual overlay check in 3D Slicer or ITK-SNAP. Load the T1ce image and overlay the segmentation. Scroll through axial, sagittal, and coronal planes. Does the enhancing tumor label (brightest on T1ce) actually overlay the bright region?
Check spatial alignment. Image and label must have identical dimensions, voxel spacing, and orientation. img.shape == label.shape and img.affine == label.affine (approximately).
Check data type. Labels should be integer type (int16, uint8). Float-type labels with values like 0.9999 instead of 1.0 will cause subtle errors.
Check for empty labels. Some patients may have all-zero segmentation masks (no visible tumor). These are valid but can cause problems if too numerous. Count how many.
import nibabel as nib
import numpy as np
import os

label_dir = "/path/to/nnUNet_raw/Dataset001_BraTS/labelsTr"

for f in sorted(os.listdir(label_dir)):
  label = nib.load(os.path.join(label_dir, f)).get_fdata()
  unique = np.unique(label).astype(int)
  has_tumor = np.any(label > 0)
  print(f"{f}: labels={unique}, shape={label.shape}, "
        f"dtype={label.dtype}, tumor={has_tumor}")
  if not has_tumor:
    print(f" ⚠️ WARNING: Empty label (no tumor voxels)")

Region-Based Training for BraTS

BraTS evaluates on three nested regions: Whole Tumor (WT), Tumor Core (TC), and Enhancing Tumor (ET). Standard multi-class training treats labels as independent classes, but these regions are hierarchical — ET is a subset of TC, which is a subset of WT. nnU-Net v2’s region-based training explicitly models this hierarchy and typically improves Dice scores by 1–2 points on the nested sub-regions.

To enable it, modify your dataset.json to define regions instead of classes:

{
  "channel_names": {"0": "T1", "1": "T1ce", "2": "T2", "3": "FLAIR"},
  "labels": {
    "background": 0,
    "whole_tumor": [1, 2, 3],    // WT = all tumor labels
    "tumor_core": [1, 3],       // TC = NCR + ET
    "enhancing_tumor": [3]      // ET only
  },
  "regions_class_order": [1, 2, 3],
  "numTraining": 1251,
  "file_ending": ".nii.gz"
}
📚
Why region-based training matters: With standard training, the model predicts mutually exclusive classes (each voxel is exactly one class). With region-based training, the model predicts overlapping binary masks — a voxel can simultaneously be part of WT, TC, and ET. This matches BraTS’s evaluation scheme and prevents the model from making hierarchically inconsistent predictions (e.g., predicting ET without TC).

Patch Sizes, Batch Sizes & GPU Memory

You can’t fit a 240×240×155×4-channel brain MRI into GPU memory alongside a 3D U-Net. Instead, you train on patches — random sub-volumes cropped from the full image. The patch size is the single most impactful hyperparameter for GPU memory usage, and nnU-Net automatically selects it. But understanding the trade-offs helps you troubleshoot.

128³
Typical nnU-Net patch size for BraTS on 12–24GB GPUs
2
Typical batch size for 3D brain tumor segmentation
~30–50%
Memory savings from mixed-precision (FP16) training
Patch Size Trade-offs

Larger patches → more context (model sees more of the tumor and surrounding brain), better for large structures like whole tumor. But: more GPU memory, fewer patches per epoch, slower training. Smaller patches → less context (model may only see part of a tumor), but fits on cheaper GPUs and allows larger batch sizes. Research shows that patch-based methods can lose global context, reducing performance — one study found patch-free methods improved multi-organ Dice from 0.799 to 0.856.

Batch Size & Learning Rate Coupling

With 3D medical images, batch sizes are typically just 1–4 due to memory constraints. A 2025 study found that optimizing batch size and learning rate separately is sub-optimal — they should be coupled as a ratio. Counterintuitively, smaller batches can actually improve performance on medical imaging data: latent spaces from smaller batches captured more biologically meaningful information. nnU-Net uses instance normalization (not batch normalization) precisely because batch sizes are so small that batch statistics would be unreliable.

GPU Tier Guidelines

8–12 GB (RTX 3060/3080): 2D config works well; 3D requires reduced patches. Enable FP16. 16–24 GB (RTX 3090/4090): Standard 3D BraTS training comfortable. This is the sweet spot for students. 40+ GB (A100): Larger patches, cascade configs, faster experimentation. What challenge teams use. Google Colab free tier (T4, 16GB): Can train 3D BraTS if you reduce patch size slightly. Sessions may time out on long runs.

Running Your First Training Job

# Step 1: Verify environment variables are set
echo $nnUNet_raw $nnUNet_preprocessed $nnUNet_results

# Step 2: Plan and preprocess (do this ONCE per dataset)
nnUNetv2_plan_and_preprocess -d 001 --verify_dataset_integrity

# This will:
# 1. Extract dataset fingerprint
# 2. Plan all configurations (2D, 3D_fullres, etc.)
# 3. Preprocess all training cases
# 4. Save everything to nnUNet_preprocessed/

# Step 3: Train (start with fold 0 of 3D_fullres)
nnUNetv2_train 001 3d_fullres 0

# This trains for 1000 epochs (~12-24 hours on A100)
# Logs appear in nnUNet_results/Dataset001_BraTS/

# Step 4: (Later) Train remaining folds
nnUNetv2_train 001 3d_fullres 1
nnUNetv2_train 001 3d_fullres 2
nnUNetv2_train 001 3d_fullres 3
nnUNetv2_train 001 3d_fullres 4
💡
Pro tip: Run --verify_dataset_integrity on your first preprocessed run. It checks for mismatched image/label dimensions, missing files, and format errors. It will catch most of the conversion mistakes from Section 6.2–6.3 before you waste hours on a broken training run.

Interpreting Training Logs

nnU-Net outputs training logs to the console and saves them in the results directory. Learning to read these logs is like learning a diagnostic language — they tell you whether your training is healthy or sick.

What to Watch

Training Loss (train_loss)

Should decrease steadily over epochs. Rapid decrease early (epochs 1–100), then slower improvements. If it’s not decreasing at all: check your data, learning rate, and label encoding. If it’s oscillating wildly: learning rate is too high. If it spikes suddenly: possible corrupted training sample or numerical instability.

Validation Loss (val_loss)

Computed on the held-out 20% of data each fold. Should follow training loss downward but will eventually plateau or slightly increase (the gap between train and val loss is the overfitting gap). In nnU-Net, this gap is typically small because of heavy augmentation. If val_loss increases sharply while train_loss keeps decreasing: overfitting.

Validation Dice per Class

The most informative metric. nnU-Net reports Dice for each class on the validation set. For BraTS, watch WT, TC, and ET separately. Healthy training: WT reaches 0.88–0.92 first (easiest region), TC follows at 0.82–0.88, ET last at 0.75–0.85 (hardest). If one region is stuck near zero: the model can’t find that class (check labels, check foreground oversampling).

Learning Rate

With polynomial decay from 0.01, the LR should decrease smoothly. At epoch 500 it should be roughly half of initial; by epoch 900 it should be near zero. If you see the LR isn’t changing: something is wrong with the scheduler configuration.

Common Log Patterns and What They Mean

⚠️ Pattern: All Dice scores stuck at 0.0

Diagnosis: Model is predicting all background. Causes: Labels are all zeros (empty masks), labels use wrong encoding (float instead of int), channel mapping is wrong (model receives noise instead of meaningful modalities), or severe class imbalance without foreground oversampling. Fix: Run the label verification script from Section 6.4. Check your dataset.json labels mapping.

⚠️ Pattern: Loss is NaN

Diagnosis: Numerical instability. Causes: Learning rate too high, corrupted input data with extreme values (inf, NaN in the images), or division by zero in loss computation. Fix: Check for NaN/inf values in your input data. Try reducing learning rate. Disable mixed-precision training temporarily to isolate the issue.

⚠️ Pattern: WT Dice is good but ET Dice is very low

Diagnosis: Normal for early training (ET is the hardest region). If it persists past epoch 300–400: the model struggles with small enhancing tumor regions. Potential fixes: Enable region-based training (Section 6.5), verify T1ce channel mapping is correct (enhancing tumor is defined by T1ce contrast), check if many cases have very small or absent ET.

✓ Pattern: Healthy training

Train loss decreases smoothly. Val loss follows with a small gap. Dice scores improve rapidly in the first 200 epochs, then gradually improve through epoch 600–800, then plateau. Final scores for BraTS: WT > 0.88, TC > 0.82, ET > 0.75 on cross-validation. These are strong baselines — top teams push 2–5 points higher with the optimizations you’ll learn in Weeks 7–8.

Data Leakage: The Mistake That Ruins Everything

Data leakage occurs when information from the test set “leaks” into training, giving artificially inflated performance that won’t generalize. In medical imaging, the most common form is patient-level leakage: different scans from the same patient appearing in both training and validation sets.

+30–55%
False accuracy boost from slice-level splitting (vs patient-level)
~96%
Erroneous accuracy achieved on randomly labeled data with slice-level split
50%
Accuracy on randomly labeled data with correct patient-level split

A 2021 study quantified the impact: slice-level splitting (where 2D slices from the same patient can appear in both train and test) boosted accuracy by 30–55% across four datasets. Tests on randomly labeled data produced ~96% erroneous accuracy with slice-level splits but only 50% (chance level) with correct patient-level splits.

⚠️
nnU-Net handles this correctly by default — it splits at the patient level for cross-validation. But if you’re building a custom pipeline in MONAI or raw PyTorch, you must ensure patient-level splitting. If a patient has multiple scans (e.g., pre- and post-treatment), all scans from that patient must be in the same fold.

Dataset Size: When Do You Have Enough?

A practical question: how many training cases do you need before adding more doesn’t help? Research using exponential-plateau models gives concrete answers:

Performance Plateaus

For kidney tumor segmentation with nnU-Net: 2D reached a Dice plateau of 0.88 at ~177 cases. 3D reached 0.90 at ~440 cases. 3D models consistently need more data than 2D. A separate study found that performance gains plateau beyond approximately 80% of available data, with brain tumor segmentation reaching DSC 0.79 with residual U-Nets. For BraTS with ~1,200+ cases, you’re well past the plateau.

Another important finding: stratifying training data by tumor grade (separating high-grade and low-grade gliomas and training separate models) improved performance in 64.9% of cases (p<0.0001). This suggests that how you organize your data can matter as much as how much you have.

Quality Control Checklist Before Training

Run through this checklist before every training run. It takes 15 minutes and saves hours of wasted compute.

1
File counts: imagesTr has exactly N×4 files (4 modalities per patient). labelsTr has exactly N files. Every patient has all 4 modalities and 1 label.
2
Naming consistency: Patient IDs match across imagesTr and labelsTr. Channel suffixes (_0000 through _0003) are correct and consistent.
3
Label values: Every label file contains only expected integer values. No floats, no negative values, no unexpected classes.
4
Spatial consistency: All images and labels for the same patient have identical shape, spacing, and orientation.
5
Visual spot-check: Load 3–5 random patients in 3D Slicer. Overlay label on T1ce. Does enhancing tumor match the bright enhancement? Does edema match the FLAIR hyperintensity?
6
dataset.json validity: channel_names matches your modality mapping. labels matches your label encoding. numTraining matches the actual file count.
7
Environment variables: nnUNet_raw, nnUNet_preprocessed, nnUNet_results all point to valid, writable directories with sufficient disk space.
8
Integrity check: Run nnUNetv2_plan_and_preprocess -d 001 --verify_dataset_integrity and fix any reported errors before training.

This Week’s Learning Resources

Hands-On (Do These This Week)

The definitive reference for folder structure, naming conventions, and dataset.json specification. Read this before converting a single file.
Step-by-step from plan_and_preprocess through training and inference. Follow this exactly for your first run.
The standard conversion tool. Not needed for BraTS data, but essential when you work with clinical DICOM data. Handles manufacturer-specific quirks.
Use this week to overlay labels on images for every QC check. The “Segment Editor” module lets you load NIfTI segmentations as overlays.
Includes specific BraTS-Metastases data conversion code. Helpful for seeing another example of the conversion process.

Key Papers

The dcm2niix paper. Documents manufacturer-specific conversion challenges across modalities. Essential if you ever work with clinical DICOM data.
J Neurosci Methods. 2016;264:47–56
The definitive study showing label quality matters more than image quality. Provides evidence for why the verification steps in this lesson aren’t optional.
Med Phys. 2023;50(6):3644–3656
Quantifies the devastating impact of slice-level vs patient-level splitting: 30–55% false accuracy boost. The paper that should convince you to never skip patient-level splitting.
Sci Rep. 2021;11:22544
Exponential-plateau models showing exactly how many training cases are needed for 2D vs 3D nnU-Net configurations. Practical guidance for planning data collection.
J Digit Imaging. 2023;36:1770–1781
Failed replication study showing that insufficient preprocessing description causes reproduction failures. Motivation for documenting every step of your pipeline.
BMJ Open. 2022;12:e059000

Deep Dives (Advanced)

Comprehensive review of solutions for scarce, sparse, noisy, and weak annotations. Your roadmap when label quality is a problem.
Review of methods for handling label noise. Critical when your annotations come from multiple annotators with varying expertise.
The standard for organizing neuroimaging datasets. Useful when managing large multi-site datasets before converting to nnU-Net format.
Evidence that training separate models for HGG and LGG improves performance in 64.9% of cases. An advanced data organization strategy.