Last week you learned what nnU-Net does. This week you learn how it does it — every internal decision, from how the dataset fingerprint drives architecture choices to exactly what happens during inference. By the end, you’ll be able to read an nnUNetPlans.json file and understand every line.
When you run nnUNetv2_plan_and_preprocess, the first thing nnU-Net does is create a dataset fingerprint — a JSON file containing every property of your data that matters for pipeline design. This fingerprint drives every subsequent decision. Understanding it is the key to understanding nnU-Net.
Median image shape: The typical volume dimensions across your dataset (e.g., 240×240×155 for BraTS). Determines the upper bound on patch size and whether a cascade is needed. Voxel spacing distribution: The physical size of each voxel in mm (e.g., 1.0×1.0×1.0 for BraTS). Determines the target resampling resolution. If one axis has much larger spacing than others (anisotropic data), nnU-Net uses special handling. Intensity statistics per modality: Mean, standard deviation, percentiles of voxel intensities. Determines whether CT-style (global) or MRI-style (per-image) normalization is used.
Class frequencies: How many voxels belong to each class. For BraTS, the background vastly outnumbers all tumor classes. This influences post-processing decisions (whether to remove small connected components). Number of classes: Determines the output channels of the network. Dataset size: Number of training cases. Affects batch size and whether all configurations are worth training.
After fingerprinting, examine the output:
# After running plan_and_preprocess, look at these files:
nnUNet_preprocessed/Dataset001_BraTS/
├── dataset_fingerprint.json # Raw dataset properties
├── nnUNetPlans.json # All pipeline decisions
├── nnUNetPlans_2d/ # 2D config preprocessed data
└── nnUNetPlans_3d_fullres/ # 3D config preprocessed data
nnUNetv2_plan_and_preprocess -d 001 --verify_dataset_integrity, open dataset_fingerprint.json and nnUNetPlans.json in a text editor. Trace each value in the fingerprint to the decisions in the plans file. This exercise is worth more than reading ten tutorials.Every property maps to specific pipeline choices. Here’s the chain of logic:
nnU-Net sets the target resampling resolution to the median voxel spacing across the dataset. For anisotropic data (e.g., thick-slice CT with 0.7×0.7×5.0mm), it uses a lower percentile for the coarse axis to avoid excessive upsampling. BraTS is already 1mm isotropic, so no resampling is needed.
If the median image (after resampling) is small enough to cover with a single 3D patch within GPU memory, no cascade is needed. If images are very large, the 3D low-resolution + cascade configuration is generated. For BraTS (240×240×155), the full-resolution 3D config can use patches of ~128×128×128, which is sufficient — no cascade needed.
Downsampling continues until the feature map drops below 4–8 voxels per axis. A 128×128×128 patch with stride-2 downsampling yields: 64→32→16→8→4, giving 5 downsampling stages. Each stage doubles the feature channels: 32→64→128→256→320 (capped at 320 for 3D).
Given the network topology and patch size, nnU-Net maximizes batch size to fill available GPU memory. For BraTS on a typical GPU (8–12GB), this usually means batch size 2. The patch size, network depth, and batch size are jointly optimized — they can’t be set independently.
CT images get global Z-score normalization (mean and std computed across the entire dataset). MRI images get per-image Z-score normalization (each scan normalized using its own foreground statistics), because MRI intensities are arbitrary and vary across scanners. nnU-Net detects CT vs MRI from the dataset.json channel names.
Once the plan is made, nnU-Net preprocesses every training case and saves the result. Understanding these steps helps you debug when things go wrong.
The image is cropped to the bounding box of non-zero voxels, removing empty background. This reduces the volume the network needs to process. For skull-stripped brain MRI, this removes the empty space around the brain. The crop coordinates are saved so they can be reversed during inference.
Images are resampled to the target spacing using third-order spline interpolation (preserves smooth intensity gradients). Labels are resampled using nearest-neighbor interpolation (preserves discrete class boundaries without creating invalid intermediate values). For anisotropic data, the low-resolution axis may use first-order interpolation to avoid ringing artifacts.
For MRI (including BraTS): each image is independently Z-score normalized using its foreground mask statistics. The foreground is defined as all non-zero voxels. This means: normalized = (image - foreground_mean) / foreground_std. Voxels outside the foreground remain zero. This per-image approach handles the arbitrary intensity scale of MRI across different scanners and protocols.
nnU-Net doesn’t just train one model — it trains up to three fundamentally different configurations and then selects the best. Understanding when each configuration is generated and when it excels is essential for interpreting your results.
Always generated. Processes individual 2D slices extracted along the axis with the highest in-plane resolution. Uses 2D convolutions with up to 512 feature channels (more than 3D because memory is cheaper in 2D). The 2D config serves as a baseline and sometimes wins on datasets with extreme anisotropy (thick slices) or very high in-plane resolution where 3D patches would be too small to capture context.
Generated when images fit in GPU memory. Processes 3D patches at the dataset’s native (post-resampling) resolution. Feature channels capped at 320 to fit memory constraints. This is the configuration that usually wins for brain tumor segmentation because full 3D context matters for distinguishing enhancing tumor, tumor core, and edema. On BraTS, it typically achieves Dice scores 2–4 points higher than the 2D config.
Generated only when images are too large for full-resolution 3D processing. Stage 1 trains a 3D U-Net at reduced resolution to capture global context. Stage 2 trains another 3D U-Net at full resolution, receiving the Stage 1 prediction as an additional input channel. The cascade is omitted for BraTS because the volumes (240×240×155) are manageable at full resolution. It’s more relevant for tasks like full-body CT segmentation.
Data augmentation is applied on-the-fly during training — each batch sees a different random transformation of the same training data, effectively creating infinite training variety without storage overhead. Here’s exactly what nnU-Net applies and why:
Rotation: Random rotations of ±15° to ±30° around each axis. Simulates different head positions in the scanner. Scaling: Random zoom between 0.85× and 1.25×. Accounts for natural variation in brain and tumor size. Elastic deformation: Controlled by α (deformation magnitude, ~1000) and σ (smoothness, ~10). Produces anatomically plausible shape variations. Mirroring: Random flipping along all applicable axes. Near-free augmentation since brains are roughly symmetric.
Gamma correction: Random gamma values (typically 0.7–1.5) to simulate scanner brightness/contrast variations. Gaussian noise: Small random noise added to improve robustness to image quality differences. Brightness/contrast shifts: Simulate the inter-scanner intensity variability that is the bane of multi-institutional MRI studies.
nnU-Net splits training data into 5 folds. Critical rule: all scans from the same patient stay in the same fold to prevent data leakage. Each fold uses 80% for training and 20% for validation. This produces 5 independently trained models, each seeing a different validation set.
Cross-validation serves three purposes: (1) robust performance estimation — averaging across 5 validation sets is more reliable than a single split; (2) configuration selection — comparing 2D vs 3D vs cascade performance; (3) ensemble at inference — averaging predictions from all 5 models.
Loss: Dice + cross-entropy (equally weighted). The Dice component handles class imbalance; the CE component provides stable gradients. Optimizer: SGD with Nesterov momentum (momentum = 0.99, weight decay = 3×10⁻⁵). Learning rate: Initial LR = 0.01, polynomial decay to near zero over 1000 epochs. Formula: lr = initial_lr × (1 - epoch/max_epoch)^0.9. Epochs: 1000 (each epoch = 250 training iterations). No early stopping — the final model is used, not the best checkpoint. Normalization: Instance normalization (not batch norm). Activation: LeakyReLU (slope = 0.01). Deep supervision: Loss computed at multiple decoder resolutions during training; only full-resolution output used at inference.
One-third of patches in each batch are guaranteed to contain at least one foreground (tumor) voxel. This prevents the model from only seeing background patches, which would teach it to predict “no tumor everywhere.” This is nnU-Net’s solution to the class imbalance problem at the data level (complementing the Dice loss solution at the objective level).
For BraTS on a single NVIDIA A100 or V100: expect roughly 12–24 hours per fold, so 2.5–5 days for all 5 folds of one configuration. Training both 2D and 3D configs across all folds can take over a week on a single GPU. A study on dataset size requirements found that performance plateaus at roughly 80% of the available data, with the BraTS task reaching a Dice plateau around 0.79 with a 3D config.
Training processes patches. But inference must produce a prediction for the entire volume. Here’s exactly how nnU-Net does it:
The input image is cropped, resampled, and normalized using the exact same parameters determined during planning. Any deviation here silently breaks everything.
A patch-sized window slides across the full volume with 50% overlap (default). At each position, the model produces softmax probabilities for every voxel in the patch. The 50% overlap means most voxels are predicted by multiple patches.
Each patch prediction is multiplied by a Gaussian kernel — center voxels get full weight, edge voxels get lower weight. This prevents boundary artifacts where patches meet. The weighted predictions are summed across all overlapping patches and then normalized.
The sliding window process is repeated for each of the 5 fold models. Softmax probabilities from all 5 models are averaged before taking the final argmax. This ensemble typically adds 1–2 Dice points over any single model.
nnU-Net automatically determines (during cross-validation) whether to apply connected component analysis — removing small isolated predictions that are likely false positives. The argmax converts probabilities to discrete labels, and the result is resampled back to the original image resolution and un-cropped to restore the original spatial dimensions.
Inference speed is fast: a benchmarking study found nnU-Net achieved best-in-class segmentation in 1.456 seconds for midbrain structures. For BraTS-sized volumes with 5-fold ensembling, expect roughly 30–60 seconds per case on a modern GPU.
nnU-Net v2 is a significant rewrite that maintains the same philosophy while adding capabilities specifically relevant to BraTS-style challenges.
v2 adds an optional residual encoder that adds skip connections within encoder blocks (in addition to the U-Net skip connections between encoder and decoder). This improves gradient flow in deeper networks and can improve performance on challenging tasks. Studies confirm that residual connections within encoder blocks enhance segmentation, particularly with limited training data.
BraTS uses hierarchical labels: enhancing tumor (ET) is a subset of tumor core (TC), which is a subset of whole tumor (WT). Standard multi-class segmentation treats these as independent classes, but they’re nested. v2 supports region-based training where the model predicts regions (WT, TC, ET) as separate binary tasks, naturally encoding the hierarchy. This is how top BraTS teams configure their nnU-Net submissions.
v2 uses DatasetXXX_Name instead of v1’s TaskXXX_Name. The dataset.json format is also updated, with more explicit channel and label definitions. If you’re following older tutorials, watch for this naming difference — it’s the most common source of “dataset not found” errors.
v2 is far more extensible than v1. Custom trainers, planners, and architectures can be plugged in without modifying core framework code. This makes it practical to use nnU-Net as a framework (modifying components) rather than just a tool (running as-is). The nnUnetFormer paper demonstrated this by fusing transformer modules into nnU-Net’s deeper layers, achieving 0.936/0.921/0.872 Dice for WT/TC/ET on BraTS 2021.
FednnU-Net (2025) adds federated learning to nnU-Net, allowing multiple institutions to collaboratively train a model without sharing patient data. It introduces Federated Fingerprint Extraction and Asymmetric Federated Averaging, demonstrating consistent performance across 18 institutions. This is the future of multi-institutional brain tumor segmentation research.
nnU-Net works remarkably well out of the box, but top challenge teams always customize. Here are the modifications that matter most for brain tumor segmentation:
Configure dataset.json to define regions instead of classes. For BraTS: Whole Tumor = labels {1, 2, 3}, Tumor Core = labels {1, 3}, Enhancing Tumor = label {3}. This aligns the training objective with BraTS’s evaluation metrics and typically improves Dice by 1–2 points on the nested sub-regions.
Create a custom trainer class that inherits from nnU-Net’s default trainer. Override the loss function (e.g., add focal loss or boundary loss), change the optimizer (switch to Adam), or modify the learning rate schedule. Challenge-winning modifications are often surprisingly small — a different loss weighting or an additional augmentation can be the difference.
Real clinical data often has missing modalities (e.g., no T2 or no FLAIR). A multi-center study on 1,731 cases from 12 hospitals showed that sparsified training — randomly dropping input channels during training — allows nnU-Net to handle missing sequences at inference without significant performance loss. This is invaluable for clinical deployment.
How does nnU-Net compare when tested head-to-head against other automated frameworks on identical benchmarks?
A 2025 study evaluated both on the AMOS abdominal segmentation challenge. nnU-Net: Dice 0.924. Auto3DSeg: Dice 0.902. The difference was statistically significant, and physicians qualitatively preferred nnU-Net outputs (P=0.0027). On breast MRI, the gap was much smaller (0.946 vs 0.940), suggesting Auto3DSeg is competitive for some tasks.
Tested on 1,251 BraTS cases plus 480 real-world clinical cases from 12 hospitals. nnU-Net achieved the highest Dice (0.86 internal, 0.93 external) and lowest Hausdorff distances for all tumor classes (P<0.001). DeepMedic and NVIDIA-net were competitive but consistently lower.
In the same AMOS study: nnU-Net 0.924 vs SwinUNETR 0.837. The transformer-based architecture lagged significantly behind, reinforcing the pattern from Week 4 — systematic optimization of the full pipeline beats architectural novelty.
nnUNetPlans.json. Cross-reference this with your own Plans file to understand every decision nnU-Net made about your data.nnUNet_raw, nnUNet_preprocessed, nnUNet_results) are correctly set. Misconfigured paths are the #1 source of beginner errors.