Week 05 — Q2

In-Depth Understanding of nnU-Net

Last week you learned what nnU-Net does. This week you learn how it does it — every internal decision, from how the dataset fingerprint drives architecture choices to exactly what happens during inference. By the end, you’ll be able to read an nnUNetPlans.json file and understand every line.

Dataset Fingerprinting: How nnU-Net Reads Your Data

When you run nnUNetv2_plan_and_preprocess, the first thing nnU-Net does is create a dataset fingerprint — a JSON file containing every property of your data that matters for pipeline design. This fingerprint drives every subsequent decision. Understanding it is the key to understanding nnU-Net.

What Gets Extracted

Image Properties

Median image shape: The typical volume dimensions across your dataset (e.g., 240×240×155 for BraTS). Determines the upper bound on patch size and whether a cascade is needed. Voxel spacing distribution: The physical size of each voxel in mm (e.g., 1.0×1.0×1.0 for BraTS). Determines the target resampling resolution. If one axis has much larger spacing than others (anisotropic data), nnU-Net uses special handling. Intensity statistics per modality: Mean, standard deviation, percentiles of voxel intensities. Determines whether CT-style (global) or MRI-style (per-image) normalization is used.

Label Properties

Class frequencies: How many voxels belong to each class. For BraTS, the background vastly outnumbers all tumor classes. This influences post-processing decisions (whether to remove small connected components). Number of classes: Determines the output channels of the network. Dataset size: Number of training cases. Affects batch size and whether all configurations are worth training.

After fingerprinting, examine the output:

# After running plan_and_preprocess, look at these files:
nnUNet_preprocessed/Dataset001_BraTS/
  ├── dataset_fingerprint.json   # Raw dataset properties
  ├── nnUNetPlans.json          # All pipeline decisions
  ├── nnUNetPlans_2d/           # 2D config preprocessed data
  └── nnUNetPlans_3d_fullres/   # 3D config preprocessed data
💡
Hands-on exercise: After running nnUNetv2_plan_and_preprocess -d 001 --verify_dataset_integrity, open dataset_fingerprint.json and nnUNetPlans.json in a text editor. Trace each value in the fingerprint to the decisions in the plans file. This exercise is worth more than reading ten tutorials.

How the Fingerprint Drives Decisions

Every property maps to specific pipeline choices. Here’s the chain of logic:

SPACING
Voxel spacing → Target resolution

nnU-Net sets the target resampling resolution to the median voxel spacing across the dataset. For anisotropic data (e.g., thick-slice CT with 0.7×0.7×5.0mm), it uses a lower percentile for the coarse axis to avoid excessive upsampling. BraTS is already 1mm isotropic, so no resampling is needed.

SIZE
Image shape → Patch size & cascade decision

If the median image (after resampling) is small enough to cover with a single 3D patch within GPU memory, no cascade is needed. If images are very large, the 3D low-resolution + cascade configuration is generated. For BraTS (240×240×155), the full-resolution 3D config can use patches of ~128×128×128, which is sufficient — no cascade needed.

DEPTH
Patch size → Network depth

Downsampling continues until the feature map drops below 4–8 voxels per axis. A 128×128×128 patch with stride-2 downsampling yields: 64→32→16→8→4, giving 5 downsampling stages. Each stage doubles the feature channels: 32→64→128→256→320 (capped at 320 for 3D).

MEMORY
GPU memory → Batch size

Given the network topology and patch size, nnU-Net maximizes batch size to fill available GPU memory. For BraTS on a typical GPU (8–12GB), this usually means batch size 2. The patch size, network depth, and batch size are jointly optimized — they can’t be set independently.

INTENSITY
Modality type → Normalization scheme

CT images get global Z-score normalization (mean and std computed across the entire dataset). MRI images get per-image Z-score normalization (each scan normalized using its own foreground statistics), because MRI intensities are arbitrary and vary across scanners. nnU-Net detects CT vs MRI from the dataset.json channel names.

Preprocessing: What Happens to Your Data

Once the plan is made, nnU-Net preprocesses every training case and saves the result. Understanding these steps helps you debug when things go wrong.

Step 1: Cropping to Non-Zero Region

The image is cropped to the bounding box of non-zero voxels, removing empty background. This reduces the volume the network needs to process. For skull-stripped brain MRI, this removes the empty space around the brain. The crop coordinates are saved so they can be reversed during inference.

Step 2: Resampling

Images are resampled to the target spacing using third-order spline interpolation (preserves smooth intensity gradients). Labels are resampled using nearest-neighbor interpolation (preserves discrete class boundaries without creating invalid intermediate values). For anisotropic data, the low-resolution axis may use first-order interpolation to avoid ringing artifacts.

Step 3: Intensity Normalization

For MRI (including BraTS): each image is independently Z-score normalized using its foreground mask statistics. The foreground is defined as all non-zero voxels. This means: normalized = (image - foreground_mean) / foreground_std. Voxels outside the foreground remain zero. This per-image approach handles the arbitrary intensity scale of MRI across different scanners and protocols.

⚠️
Critical: preprocessing must match between training and inference. nnU-Net handles this automatically — the same preprocessing is applied during both phases. But if you’re building a custom pipeline that feeds into nnU-Net, any mismatch (different normalization, different resampling) will silently degrade performance. One study showed that using N4 bias correction alone without intensity normalization caused AUC to drop from 0.85 to 0.19 on external data.

2D, 3D Full-Res, and 3D Cascade

nnU-Net doesn’t just train one model — it trains up to three fundamentally different configurations and then selects the best. Understanding when each configuration is generated and when it excels is essential for interpreting your results.

2D Configuration

Always generated. Processes individual 2D slices extracted along the axis with the highest in-plane resolution. Uses 2D convolutions with up to 512 feature channels (more than 3D because memory is cheaper in 2D). The 2D config serves as a baseline and sometimes wins on datasets with extreme anisotropy (thick slices) or very high in-plane resolution where 3D patches would be too small to capture context.

3D Full-Resolution Configuration

Generated when images fit in GPU memory. Processes 3D patches at the dataset’s native (post-resampling) resolution. Feature channels capped at 320 to fit memory constraints. This is the configuration that usually wins for brain tumor segmentation because full 3D context matters for distinguishing enhancing tumor, tumor core, and edema. On BraTS, it typically achieves Dice scores 2–4 points higher than the 2D config.

3D Low-Resolution + Cascade

Generated only when images are too large for full-resolution 3D processing. Stage 1 trains a 3D U-Net at reduced resolution to capture global context. Stage 2 trains another 3D U-Net at full resolution, receiving the Stage 1 prediction as an additional input channel. The cascade is omitted for BraTS because the volumes (240×240×155) are manageable at full resolution. It’s more relevant for tasks like full-body CT segmentation.

0.924
nnU-Net Dice on AMOS challenge — significantly outperforming Auto3DSeg (0.902)
0.965
Dice on neuroblastic tumors — matching inter-observer variability (0.969)
92.8%
Time savings vs manual segmentation in clinical validation study

nnU-Net’s Augmentation Pipeline

Data augmentation is applied on-the-fly during training — each batch sees a different random transformation of the same training data, effectively creating infinite training variety without storage overhead. Here’s exactly what nnU-Net applies and why:

Spatial Augmentations (Applied First)

Rotation, Scaling, Elastic Deformation

Rotation: Random rotations of ±15° to ±30° around each axis. Simulates different head positions in the scanner. Scaling: Random zoom between 0.85× and 1.25×. Accounts for natural variation in brain and tumor size. Elastic deformation: Controlled by α (deformation magnitude, ~1000) and σ (smoothness, ~10). Produces anatomically plausible shape variations. Mirroring: Random flipping along all applicable axes. Near-free augmentation since brains are roughly symmetric.

Intensity Augmentations (Applied Second)

Gamma, Noise, Brightness

Gamma correction: Random gamma values (typically 0.7–1.5) to simulate scanner brightness/contrast variations. Gaussian noise: Small random noise added to improve robustness to image quality differences. Brightness/contrast shifts: Simulate the inter-scanner intensity variability that is the bane of multi-institutional MRI studies.

📚
Why this specific combination? These augmentation parameters were determined empirically across nnU-Net’s evaluation on 23 diverse datasets. No single augmentation is responsible for nnU-Net’s success — it’s the full combination that works. A systematic review of 300+ articles confirmed that data augmentation is effective across organs, modalities, and dataset sizes, but the specific strategies must produce plausible medical images.

Training: The 1000-Epoch Grind

5-Fold Cross-Validation

nnU-Net splits training data into 5 folds. Critical rule: all scans from the same patient stay in the same fold to prevent data leakage. Each fold uses 80% for training and 20% for validation. This produces 5 independently trained models, each seeing a different validation set.

Cross-validation serves three purposes: (1) robust performance estimation — averaging across 5 validation sets is more reliable than a single split; (2) configuration selection — comparing 2D vs 3D vs cascade performance; (3) ensemble at inference — averaging predictions from all 5 models.

Fixed Training Parameters

The Recipe That Never Changes

Loss: Dice + cross-entropy (equally weighted). The Dice component handles class imbalance; the CE component provides stable gradients. Optimizer: SGD with Nesterov momentum (momentum = 0.99, weight decay = 3×10⁻⁵). Learning rate: Initial LR = 0.01, polynomial decay to near zero over 1000 epochs. Formula: lr = initial_lr × (1 - epoch/max_epoch)^0.9. Epochs: 1000 (each epoch = 250 training iterations). No early stopping — the final model is used, not the best checkpoint. Normalization: Instance normalization (not batch norm). Activation: LeakyReLU (slope = 0.01). Deep supervision: Loss computed at multiple decoder resolutions during training; only full-resolution output used at inference.

Foreground Oversampling

One-third of patches in each batch are guaranteed to contain at least one foreground (tumor) voxel. This prevents the model from only seeing background patches, which would teach it to predict “no tumor everywhere.” This is nnU-Net’s solution to the class imbalance problem at the data level (complementing the Dice loss solution at the objective level).

Training Time Expectations

For BraTS on a single NVIDIA A100 or V100: expect roughly 12–24 hours per fold, so 2.5–5 days for all 5 folds of one configuration. Training both 2D and 3D configs across all folds can take over a week on a single GPU. A study on dataset size requirements found that performance plateaus at roughly 80% of the available data, with the BraTS task reaching a Dice plateau around 0.79 with a 3D config.

Inference: Sliding Window & Ensembling

Training processes patches. But inference must produce a prediction for the entire volume. Here’s exactly how nnU-Net does it:

01
Preprocessing (Identical to Training)

The input image is cropped, resampled, and normalized using the exact same parameters determined during planning. Any deviation here silently breaks everything.

02
Sliding Window Prediction

A patch-sized window slides across the full volume with 50% overlap (default). At each position, the model produces softmax probabilities for every voxel in the patch. The 50% overlap means most voxels are predicted by multiple patches.

03
Gaussian Weighting

Each patch prediction is multiplied by a Gaussian kernel — center voxels get full weight, edge voxels get lower weight. This prevents boundary artifacts where patches meet. The weighted predictions are summed across all overlapping patches and then normalized.

04
Fold Ensembling

The sliding window process is repeated for each of the 5 fold models. Softmax probabilities from all 5 models are averaged before taking the final argmax. This ensemble typically adds 1–2 Dice points over any single model.

05
Post-Processing

nnU-Net automatically determines (during cross-validation) whether to apply connected component analysis — removing small isolated predictions that are likely false positives. The argmax converts probabilities to discrete labels, and the result is resampled back to the original image resolution and un-cropped to restore the original spatial dimensions.

Inference speed is fast: a benchmarking study found nnU-Net achieved best-in-class segmentation in 1.456 seconds for midbrain structures. For BraTS-sized volumes with 5-fold ensembling, expect roughly 30–60 seconds per case on a modern GPU.

nnU-Net v2: What’s New and Why It Matters

nnU-Net v2 is a significant rewrite that maintains the same philosophy while adding capabilities specifically relevant to BraTS-style challenges.

Residual Encoder (ResEnc)

v2 adds an optional residual encoder that adds skip connections within encoder blocks (in addition to the U-Net skip connections between encoder and decoder). This improves gradient flow in deeper networks and can improve performance on challenging tasks. Studies confirm that residual connections within encoder blocks enhance segmentation, particularly with limited training data.

Region-Based Training (Critical for BraTS)

BraTS uses hierarchical labels: enhancing tumor (ET) is a subset of tumor core (TC), which is a subset of whole tumor (WT). Standard multi-class segmentation treats these as independent classes, but they’re nested. v2 supports region-based training where the model predicts regions (WT, TC, ET) as separate binary tasks, naturally encoding the hierarchy. This is how top BraTS teams configure their nnU-Net submissions.

New Dataset Format

v2 uses DatasetXXX_Name instead of v1’s TaskXXX_Name. The dataset.json format is also updated, with more explicit channel and label definitions. If you’re following older tutorials, watch for this naming difference — it’s the most common source of “dataset not found” errors.

Modular Configuration System

v2 is far more extensible than v1. Custom trainers, planners, and architectures can be plugged in without modifying core framework code. This makes it practical to use nnU-Net as a framework (modifying components) rather than just a tool (running as-is). The nnUnetFormer paper demonstrated this by fusing transformer modules into nnU-Net’s deeper layers, achieving 0.936/0.921/0.872 Dice for WT/TC/ET on BraTS 2021.

Federated Learning Extension

FednnU-Net (2025) adds federated learning to nnU-Net, allowing multiple institutions to collaboratively train a model without sharing patient data. It introduces Federated Fingerprint Extraction and Asymmetric Federated Averaging, demonstrating consistent performance across 18 institutions. This is the future of multi-institutional brain tumor segmentation research.

Customizing nnU-Net for BraTS

nnU-Net works remarkably well out of the box, but top challenge teams always customize. Here are the modifications that matter most for brain tumor segmentation:

Enable Region-Based Training

Configure dataset.json to define regions instead of classes. For BraTS: Whole Tumor = labels {1, 2, 3}, Tumor Core = labels {1, 3}, Enhancing Tumor = label {3}. This aligns the training objective with BraTS’s evaluation metrics and typically improves Dice by 1–2 points on the nested sub-regions.

Custom Trainer for Learning Rate or Loss

Create a custom trainer class that inherits from nnU-Net’s default trainer. Override the loss function (e.g., add focal loss or boundary loss), change the optimizer (switch to Adam), or modify the learning rate schedule. Challenge-winning modifications are often surprisingly small — a different loss weighting or an additional augmentation can be the difference.

Handling Missing MRI Sequences

Real clinical data often has missing modalities (e.g., no T2 or no FLAIR). A multi-center study on 1,731 cases from 12 hospitals showed that sparsified training — randomly dropping input channels during training — allows nnU-Net to handle missing sequences at inference without significant performance loss. This is invaluable for clinical deployment.

💡
The competition strategy: Start with nnU-Net out of the box as your baseline. Record the Dice scores. Then make one modification at a time: enable region-based training, try the residual encoder, add test-time augmentation, experiment with post-processing. Keep every change that improves validation performance, revert those that don’t. This disciplined approach is how top teams iterate — not by rewriting everything from scratch.

nnU-Net vs. Other Frameworks: The Evidence

How does nnU-Net compare when tested head-to-head against other automated frameworks on identical benchmarks?

nnU-Net vs. MONAI Auto3DSeg (AMOS Challenge)

A 2025 study evaluated both on the AMOS abdominal segmentation challenge. nnU-Net: Dice 0.924. Auto3DSeg: Dice 0.902. The difference was statistically significant, and physicians qualitatively preferred nnU-Net outputs (P=0.0027). On breast MRI, the gap was much smaller (0.946 vs 0.940), suggesting Auto3DSeg is competitive for some tasks.

nnU-Net vs. DeepMedic & NVIDIA-net (Brain Tumors)

Tested on 1,251 BraTS cases plus 480 real-world clinical cases from 12 hospitals. nnU-Net achieved the highest Dice (0.86 internal, 0.93 external) and lowest Hausdorff distances for all tumor classes (P<0.001). DeepMedic and NVIDIA-net were competitive but consistently lower.

nnU-Net vs. SwinUNETR (AMOS)

In the same AMOS study: nnU-Net 0.924 vs SwinUNETR 0.837. The transformer-based architecture lagged significantly behind, reinforcing the pattern from Week 4 — systematic optimization of the full pipeline beats architectural novelty.

⚠️
An important caveat: A study on stroke lesion segmentation found that nnU-Net achieved excellent segmentation metrics but failed to detect therapy-induced volume reductions, leading to false-negative study outcomes. High Dice scores don’t always mean clinical utility. Always validate against the actual clinical question, not just overlap metrics.

This Week’s Learning Resources

Hands-On (Do These This Week)

The official step-by-step guide. This week, work through it end-to-end on BraTS data: plan, preprocess, train one fold, run inference, examine outputs. Read the Plans JSON file.
Official documentation explaining every field in nnUNetPlans.json. Cross-reference this with your own Plans file to understand every decision nnU-Net made about your data.
Make sure your three environment variables (nnUNet_raw, nnUNet_preprocessed, nnUNet_results) are correctly set. Misconfigured paths are the #1 source of beginner errors.
The most comprehensive written tutorial. Covers both theory and practice: fingerprinting, the three configurations, and running on custom datasets. Written during a Cambridge research internship.
Includes a Google Colab notebook. Run nnU-Net without local GPU setup. Great for a first end-to-end experiment.

Key Papers

The foundational paper. This week, read beyond the abstract — study the Methods section, Supplementary Tables, and especially the ablation experiments. This is the primary source for understanding every design decision.
Nat Methods. 2021;18(2):203–211
Demonstrates how to extend nnU-Net by fusing transformer modules into deeper layers. Achieved 0.936/0.921/0.872 Dice on BraTS 2021 for WT/TC/ET. A model for how top teams customize the framework.
Phys Med Biol. 2023;68(23):235009
The definitive real-world clinical validation: 12 hospitals, 480 clinical cases, missing sequences handled via sparsified training. Proves nnU-Net works outside of curated challenge datasets.
Sci Rep. 2023;13:19474
Head-to-head comparison on identical benchmarks with physician evaluation. nnU-Net significantly outperformed both alternatives. The quantitative evidence for nnU-Net’s dominance.
Intell Oncol. 2025;4(1):100119
Multi-institutional pediatric study proving nnU-Net’s robustness across sites (Dice 0.80–0.86). Shows how 5-fold cross-validation handles site-specific variation.
Radiol Artif Intell. 2024;6(3):e230115

Deep Dives (Advanced)

Federated fingerprint extraction and asymmetric federated averaging across 18 institutions. The future of privacy-preserving multi-institutional segmentation.
Continual learning framework built on nnU-Net. How to train models that learn new tasks without forgetting old ones. Relevant for clinical deployment where new data arrives continuously.
How to train nnU-Net when different subsets of your data have different labels annotated. Relevant when combining data from multiple sources with inconsistent annotations.
An nnU-Net model trained to segment 80 anatomical structures across any MRI sequence. Demonstrates the scale of what’s possible with nnU-Net as a foundation.