Theory is over — this is the hands-on week. You’ll take raw BraTS data, convert it into the exact format nnU-Net expects, launch your first real training runs, and learn to read training logs like a diagnostic tool. Most beginners get stuck here, so we’ll cover every common pitfall.
If you’re working with BraTS challenge data, you can skip this section — BraTS data is already in NIfTI format. But if you ever work with clinical data from a hospital, it will arrive as DICOM files, and converting them correctly is the first critical step.
DICOM (Digital Imaging and Communications in Medicine) is the universal format used by MRI scanners, PACS systems, and hospitals. Each scan produces hundreds or thousands of individual DICOM files (one per slice). Converting to NIfTI combines these into a single 3D volume file with standardized metadata. The standard tool is dcm2niix.
# Converting DICOM to NIfTI with dcm2niix
dcm2niix -z y -o /output/path /dicom/input/folder
# -z y : compress output as .nii.gz
# Always verify the output!
python -c "
import nibabel as nib
img = nib.load('output.nii.gz')
print('Shape:', img.shape)
print('Spacing:', img.header.get_zooms())
print('Orientation:', nib.aff2axcodes(img.affine))
"
This is where most beginners spend the most time debugging. nnU-Net is extremely strict about its folder structure and naming conventions. One wrong filename suffix, one misnamed folder, and the pipeline will either fail with a cryptic error or silently produce wrong results.
nnUNet_raw/
Dataset001_BraTS/
├── imagesTr/ # Training images
│ ├── BraTS_001_0000.nii.gz # Patient 001, T1 (channel 0)
│ ├── BraTS_001_0001.nii.gz # Patient 001, T1ce (channel 1)
│ ├── BraTS_001_0002.nii.gz # Patient 001, T2 (channel 2)
│ ├── BraTS_001_0003.nii.gz # Patient 001, FLAIR (channel 3)
│ ├── BraTS_002_0000.nii.gz # Patient 002, T1
│ ├── BraTS_002_0001.nii.gz # Patient 002, T1ce
│ └── ... # 4 files per patient
├── labelsTr/ # Training labels
│ ├── BraTS_001.nii.gz # Patient 001 segmentation
│ ├── BraTS_002.nii.gz # Patient 002 segmentation
│ └── ... # NO channel suffix on labels
├── imagesTs/ # (Optional) Test images
└── dataset.json # Metadata file
The _0000, _0001, _0002, _0003 suffixes tell nnU-Net which modality is which. This mapping must be consistent across all patients. If T1 is _0000 for patient 001, it must be _0000 for every single patient. Mixing this up is one of the most common and devastating errors.
_0000 → T1-weighted (T1)
_0001 → T1 contrast-enhanced (T1ce / T1Gd)
_0002 → T2-weighted (T2)
_0003 → Fluid-Attenuated Inversion Recovery (FLAIR)
{
"channel_names": {
"0": "T1",
"1": "T1ce",
"2": "T2",
"3": "FLAIR"
},
"labels": {
"background": 0,
"NCR": 1,
"ED": 2,
"ET": 3
},
"numTraining": 1251,
"file_ending": ".nii.gz"
}
BraTS organizes data by patient folder with named modality files. nnU-Net needs a flat directory with channel suffixes. Here’s how to bridge the two:
import os, shutil, json
# BraTS structure: BraTS2021_00000/BraTS2021_00000_t1.nii.gz
# nnU-Net needs: imagesTr/BraTS2021_00000_0000.nii.gz
brats_dir = "/path/to/BraTS2021/training"
nnunet_raw = "/path/to/nnUNet_raw/Dataset001_BraTS"
os.makedirs(f"{nnunet_raw}/imagesTr", exist_ok=True)
os.makedirs(f"{nnunet_raw}/labelsTr", exist_ok=True)
# Modality-to-channel mapping
modality_map = {
"t1": "0000", # T1-weighted
"t1ce": "0001", # T1 contrast-enhanced
"t2": "0002", # T2-weighted
"flair": "0003", # FLAIR
}
for patient in sorted(os.listdir(brats_dir)):
patient_dir = os.path.join(brats_dir, patient)
if not os.path.isdir(patient_dir): continue
# Copy modality files with channel suffix
for mod, suffix in modality_map.items():
src = os.path.join(patient_dir, f"{patient}_{mod}.nii.gz")
dst = os.path.join(nnunet_raw, "imagesTr",
f"{patient}_{suffix}.nii.gz")
shutil.copy2(src, dst)
# Copy segmentation label (NO channel suffix)
seg_src = os.path.join(patient_dir, f"{patient}_seg.nii.gz")
seg_dst = os.path.join(nnunet_raw, "labelsTr",
f"{patient}.nii.gz")
shutil.copy2(seg_src, seg_dst)
print(f"Converted {len(os.listdir(nnunet_raw + '/labelsTr'))} patients")
After running this, always verify: count files in imagesTr (should be 4× the number of patients), count files in labelsTr (should equal number of patients), and spot-check a few by loading image + label in 3D Slicer to confirm alignment.
A study on synthetic data imperfections found that label quality is more important than image quality for segmentation model performance. Degraded labels caused substantial performance drops, while degraded images actually made models more robust. Another study found that automated label error detection improved model accuracy by up to 45%. Take label verification seriously.
img.shape == label.shape and img.affine == label.affine (approximately).
import nibabel as nib
import numpy as np
import os
label_dir = "/path/to/nnUNet_raw/Dataset001_BraTS/labelsTr"
for f in sorted(os.listdir(label_dir)):
label = nib.load(os.path.join(label_dir, f)).get_fdata()
unique = np.unique(label).astype(int)
has_tumor = np.any(label > 0)
print(f"{f}: labels={unique}, shape={label.shape}, "
f"dtype={label.dtype}, tumor={has_tumor}")
if not has_tumor:
print(f" ⚠️ WARNING: Empty label (no tumor voxels)")
BraTS evaluates on three nested regions: Whole Tumor (WT), Tumor Core (TC), and Enhancing Tumor (ET). Standard multi-class training treats labels as independent classes, but these regions are hierarchical — ET is a subset of TC, which is a subset of WT. nnU-Net v2’s region-based training explicitly models this hierarchy and typically improves Dice scores by 1–2 points on the nested sub-regions.
To enable it, modify your dataset.json to define regions instead of classes:
{
"channel_names": {"0": "T1", "1": "T1ce", "2": "T2", "3": "FLAIR"},
"labels": {
"background": 0,
"whole_tumor": [1, 2, 3], // WT = all tumor labels
"tumor_core": [1, 3], // TC = NCR + ET
"enhancing_tumor": [3] // ET only
},
"regions_class_order": [1, 2, 3],
"numTraining": 1251,
"file_ending": ".nii.gz"
}
You can’t fit a 240×240×155×4-channel brain MRI into GPU memory alongside a 3D U-Net. Instead, you train on patches — random sub-volumes cropped from the full image. The patch size is the single most impactful hyperparameter for GPU memory usage, and nnU-Net automatically selects it. But understanding the trade-offs helps you troubleshoot.
Larger patches → more context (model sees more of the tumor and surrounding brain), better for large structures like whole tumor. But: more GPU memory, fewer patches per epoch, slower training. Smaller patches → less context (model may only see part of a tumor), but fits on cheaper GPUs and allows larger batch sizes. Research shows that patch-based methods can lose global context, reducing performance — one study found patch-free methods improved multi-organ Dice from 0.799 to 0.856.
With 3D medical images, batch sizes are typically just 1–4 due to memory constraints. A 2025 study found that optimizing batch size and learning rate separately is sub-optimal — they should be coupled as a ratio. Counterintuitively, smaller batches can actually improve performance on medical imaging data: latent spaces from smaller batches captured more biologically meaningful information. nnU-Net uses instance normalization (not batch normalization) precisely because batch sizes are so small that batch statistics would be unreliable.
8–12 GB (RTX 3060/3080): 2D config works well; 3D requires reduced patches. Enable FP16. 16–24 GB (RTX 3090/4090): Standard 3D BraTS training comfortable. This is the sweet spot for students. 40+ GB (A100): Larger patches, cascade configs, faster experimentation. What challenge teams use. Google Colab free tier (T4, 16GB): Can train 3D BraTS if you reduce patch size slightly. Sessions may time out on long runs.
# Step 1: Verify environment variables are set
echo $nnUNet_raw $nnUNet_preprocessed $nnUNet_results
# Step 2: Plan and preprocess (do this ONCE per dataset)
nnUNetv2_plan_and_preprocess -d 001 --verify_dataset_integrity
# This will:
# 1. Extract dataset fingerprint
# 2. Plan all configurations (2D, 3D_fullres, etc.)
# 3. Preprocess all training cases
# 4. Save everything to nnUNet_preprocessed/
# Step 3: Train (start with fold 0 of 3D_fullres)
nnUNetv2_train 001 3d_fullres 0
# This trains for 1000 epochs (~12-24 hours on A100)
# Logs appear in nnUNet_results/Dataset001_BraTS/
# Step 4: (Later) Train remaining folds
nnUNetv2_train 001 3d_fullres 1
nnUNetv2_train 001 3d_fullres 2
nnUNetv2_train 001 3d_fullres 3
nnUNetv2_train 001 3d_fullres 4
--verify_dataset_integrity on your first preprocessed run. It checks for mismatched image/label dimensions, missing files, and format errors. It will catch most of the conversion mistakes from Section 6.2–6.3 before you waste hours on a broken training run.nnU-Net outputs training logs to the console and saves them in the results directory. Learning to read these logs is like learning a diagnostic language — they tell you whether your training is healthy or sick.
Should decrease steadily over epochs. Rapid decrease early (epochs 1–100), then slower improvements. If it’s not decreasing at all: check your data, learning rate, and label encoding. If it’s oscillating wildly: learning rate is too high. If it spikes suddenly: possible corrupted training sample or numerical instability.
Computed on the held-out 20% of data each fold. Should follow training loss downward but will eventually plateau or slightly increase (the gap between train and val loss is the overfitting gap). In nnU-Net, this gap is typically small because of heavy augmentation. If val_loss increases sharply while train_loss keeps decreasing: overfitting.
The most informative metric. nnU-Net reports Dice for each class on the validation set. For BraTS, watch WT, TC, and ET separately. Healthy training: WT reaches 0.88–0.92 first (easiest region), TC follows at 0.82–0.88, ET last at 0.75–0.85 (hardest). If one region is stuck near zero: the model can’t find that class (check labels, check foreground oversampling).
With polynomial decay from 0.01, the LR should decrease smoothly. At epoch 500 it should be roughly half of initial; by epoch 900 it should be near zero. If you see the LR isn’t changing: something is wrong with the scheduler configuration.
Diagnosis: Model is predicting all background. Causes: Labels are all zeros (empty masks), labels use wrong encoding (float instead of int), channel mapping is wrong (model receives noise instead of meaningful modalities), or severe class imbalance without foreground oversampling. Fix: Run the label verification script from Section 6.4. Check your dataset.json labels mapping.
Diagnosis: Numerical instability. Causes: Learning rate too high, corrupted input data with extreme values (inf, NaN in the images), or division by zero in loss computation. Fix: Check for NaN/inf values in your input data. Try reducing learning rate. Disable mixed-precision training temporarily to isolate the issue.
Diagnosis: Normal for early training (ET is the hardest region). If it persists past epoch 300–400: the model struggles with small enhancing tumor regions. Potential fixes: Enable region-based training (Section 6.5), verify T1ce channel mapping is correct (enhancing tumor is defined by T1ce contrast), check if many cases have very small or absent ET.
Train loss decreases smoothly. Val loss follows with a small gap. Dice scores improve rapidly in the first 200 epochs, then gradually improve through epoch 600–800, then plateau. Final scores for BraTS: WT > 0.88, TC > 0.82, ET > 0.75 on cross-validation. These are strong baselines — top teams push 2–5 points higher with the optimizations you’ll learn in Weeks 7–8.
Data leakage occurs when information from the test set “leaks” into training, giving artificially inflated performance that won’t generalize. In medical imaging, the most common form is patient-level leakage: different scans from the same patient appearing in both training and validation sets.
A 2021 study quantified the impact: slice-level splitting (where 2D slices from the same patient can appear in both train and test) boosted accuracy by 30–55% across four datasets. Tests on randomly labeled data produced ~96% erroneous accuracy with slice-level splits but only 50% (chance level) with correct patient-level splits.
A practical question: how many training cases do you need before adding more doesn’t help? Research using exponential-plateau models gives concrete answers:
For kidney tumor segmentation with nnU-Net: 2D reached a Dice plateau of 0.88 at ~177 cases. 3D reached 0.90 at ~440 cases. 3D models consistently need more data than 2D. A separate study found that performance gains plateau beyond approximately 80% of available data, with brain tumor segmentation reaching DSC 0.79 with residual U-Nets. For BraTS with ~1,200+ cases, you’re well past the plateau.
Another important finding: stratifying training data by tumor grade (separating high-grade and low-grade gliomas and training separate models) improved performance in 64.9% of cases (p<0.0001). This suggests that how you organize your data can matter as much as how much you have.
Run through this checklist before every training run. It takes 15 minutes and saves hours of wasted compute.
imagesTr has exactly N×4 files (4 modalities per patient). labelsTr has exactly N files. Every patient has all 4 modalities and 1 label.imagesTr and labelsTr. Channel suffixes (_0000 through _0003) are correct and consistent.channel_names matches your modality mapping. labels matches your label encoding. numTraining matches the actual file count.nnUNet_raw, nnUNet_preprocessed, nnUNet_results all point to valid, writable directories with sufficient disk space.nnUNetv2_plan_and_preprocess -d 001 --verify_dataset_integrity and fix any reported errors before training.