Your baseline model is trained. Now the real work begins — squeezing out every Dice point through ensembling, test-time augmentation, advanced loss functions, error analysis, and the specific modifications that separate a solid submission from a challenge-winning one.
After training a baseline nnU-Net model (Week 5–6) and evaluating it (Week 7), you’ll have scores that are good but not competitive. The difference between a baseline Dice of 0.87 and a winning score of 0.93 comes from stacking small improvements systematically. Here’s the rough breakdown of where those points come from, based on published ablation studies and challenge reports:
Ensembling is the single most reliable way to improve segmentation performance. The core idea: multiple independent models make different errors, and averaging their predictions cancels out individual mistakes. Studies consistently show 2–7% Dice improvements and substantial HD95 reductions from ensembling.
nnU-Net’s 5-fold cross-validation produces 5 independently trained models. At inference, their softmax outputs are averaged before argmax. This is free — you already have the models. It typically adds 1–2 Dice points over any single fold and strongly reduces catastrophic failures. One study found ensembles eliminated outlier predictions in 68–100% of high-risk images.
Average predictions from different nnU-Net configurations (2D + 3D full-res). The 2D model may better capture sharp in-plane boundaries while the 3D model captures inter-slice continuity. nnU-Net automatically evaluates whether cross-configuration ensembling improves cross-validation scores.
The biggest gains come from ensembling different architectures that make different kinds of errors. Combining nnU-Net with a transformer-based model (SwinUNETR, nnUNetFormer) captures complementary features. The EnsembleUNets approach achieved Dice 0.93 on BraTS 2021, outperforming all individual models. One BraTS strategy used weighted ensembles of the top-3 models per region with customized loss functions, requiring 67% less memory and 92% less training time than training all architectures independently.
Simple averaging works surprisingly well, but learnable ensemble weights (optimized on the validation set) consistently outperform static averaging. You can optimize per-region weights: maybe the 3D model gets more weight for WT while the transformer model gets more weight for ET. A study found optimized weights yielded 2–7% DSC improvements and up to 49% reduction in average surface distance.
TTA is like asking your model to look at the same scan from multiple angles and averaging its opinions. During inference, apply augmentations (flipping, rotation) to the input, run the model on each augmented version, reverse the augmentations on the predictions, and average the results.
Flipping along each axis is the most common and cheapest TTA. For a 3D volume, flipping along x, y, z, and all combinations gives 8 versions (including the original). This alone provides 0.5–1.5 Dice points on typical tasks. Rotation (small angles) and scaling (0.95–1.05×) can add marginal improvements but cost more compute. A study on 9 medical segmentation datasets showed TTA provided 0.1–2.3% Dice improvement and 1.1–29.0% improvement in error estimation.
Each TTA variant requires one additional forward pass. With 8-way flipping TTA and 5-fold ensembling, you’re running 40 forward passes per image — inference takes 40× longer. For BraTS volumes, this means ~10–20 minutes per case instead of ~30 seconds. Acceptable for challenge submissions, but too slow for real-time clinical use. Jointly optimizing training-time and test-time augmentation can improve efficiency.
The variance across TTA predictions provides a voxel-wise uncertainty estimate. High variance = the model is unsure. This is clinically valuable: you can flag uncertain regions for radiologist review. Studies show TTA-based aleatoric uncertainty outperforms Monte Carlo dropout for uncertainty estimation and reduces overconfident predictions.
Before trying to improve your model, you need to understand where and how it fails. Random tweaking is less effective than targeted fixes. Load predictions alongside ground truth in 3D Slicer and systematically look for patterns.
Compute Dice and HD95 separately for WT, TC, and ET. Which region is weakest? For most models, ET is the bottleneck — it’s the smallest region with the most irregular boundaries. If your ET Dice is 0.75 but WT is 0.91, focus improvements on ET.
Split cases by tumor size (small / medium / large). Small enhancing tumors are disproportionately hard — a few missed voxels in a 200-voxel ET can drop Dice from 0.85 to 0.40. If small tumors are your main failure mode, consider a region-related focal loss that upweights hard voxels.
Sort patients by Dice score and inspect the worst 10–20 cases in 3D Slicer. Common patterns: false positives near surgical cavities, under-segmentation of diffuse edema, missed satellite lesions, and enhancing tumor confused with blood vessels. Each pattern suggests a different fix.
If your data comes from multiple centers, compute per-center metrics. A model trained mainly on 3T MRI may fail on 1.5T data. One study showed performance dropped 8–17% Dice when training and testing institutions differed. This finding motivates domain adaptation or institution-aware augmentation.
The default Dice + CE loss is robust, but specialized losses can target specific weaknesses revealed by error analysis. The key insight from a comprehensive analysis of 20 loss functions: compound losses are always more robust than any single loss.
Standard Dice loss measures volumetric overlap — it doesn’t directly penalize boundary errors. Boundary loss formulates the optimization as a distance metric on the contour space, using the distance transform of the ground truth. It’s particularly effective for highly imbalanced structures (small ET regions). A boundary-sensitive variant improved DSC by 4.17% and HD95 by 73% for challenging boundary regions. Important: boundary loss should be added gradually during training (loss scheduling), not from the start — the model needs Dice/CE to establish rough segmentation first.
Directly optimizes the metric you’re evaluated on. Three approaches exist: distance-transform based, morphological erosion based, and convolution-kernel based. Training with HD loss achieved 18–45% reduction in Hausdorff Distance without degrading Dice scores. This is particularly useful when your Dice is good but HD95 is poor — a common pattern where the model makes occasional outlier predictions far from the true boundary.
Standard loss treats all voxels equally, but misclassifying voxels near the boundary of a tiny enhancing tumor matters far more than misclassifying a voxel deep inside a large edema region. Region-related focal loss dynamically upweights hard-to-classify voxels with region-specific focus, improving ET Dice by 3% and average Dice by 1% on BraTS 2020. This is one of the highest-impact single changes for brain tumor segmentation.
Dice and HD don’t capture whether your prediction has the right topology — a prediction with a hole in the middle of a tumor might have high Dice but be clinically wrong. Persistent homology loss (using Betti numbers) can enforce topological constraints. A fast Euler characteristic method provides efficient topological optimization for 3D data, with significant improvements in structural correctness. Most useful for tasks where connectivity is critical (vascular segmentation, cortical parcellation).
Instead of training from random initialization, you can start from weights pre-trained on a related task. This is especially powerful when your target dataset is small.
A key finding: pre-training on medical images outperforms ImageNet pre-training, even when the medical dataset is 10× smaller. The closer the pre-training domain and task are to your target, the better the transfer. For brain tumor segmentation, pre-training on other brain MRI tasks (skull stripping, brain parcellation) provides better features than pre-training on natural photographs. RadiologyNET (1.9M medical images) showed particular advantages in resource-limited settings.
For U-Net architectures, a counterintuitive finding: fine-tuning shallow layers (early encoder) often works better than fine-tuning deep layers, because shallow layers learn low-level features critical for segmentation that differ between domains. The recommended approach: fine-tune the entire network with a lower learning rate (1/10th to 1/100th of the from-scratch rate), with optional warm-up over the first 5–10 epochs. Fine-tuned models consistently match or outperform from-scratch training and are more robust to small training set sizes.
With foundation models like SAM, you can fine-tune a tiny fraction of parameters while freezing the bulk of the pre-trained model. PEFT methods achieve competitive performance with as few as 1–5 labeled samples. SemiSAM+ demonstrated full accuracy using only 22–49% of available labeled data through specialist-generalist collaborative learning. This is the future direction for low-data scenarios.
nnU-Net’s default augmentation (rotation, scaling, elastic deformation, gamma, noise, mirroring) is strong, but specialized augmentations can target specific weaknesses.
Instead of mixing random image regions, CarveMix carves out regions based on lesion location and geometry, then transplants them into other images. This creates realistic training samples with novel lesion-context combinations, improving brain lesion segmentation accuracy. It’s particularly valuable for rare tumor appearances.
AdvChain generates randomly chained geometric and photometric transformations that resemble realistic but challenging imaging variations. It can alleviate the need for labeled data while improving model generalization, applicable for both low-shot supervised and semi-supervised learning. Demonstrated for cardiac and prostate MRI segmentation.
Multi-Channel Fusion Diffusion Models can generate synthetic brain tumor MRI with all four modalities. Adding synthetic data improved classification accuracy by ~3% and segmentation Dice by 1.5–2.5%. CycleGAN-based domain translation also improved out-of-distribution performance dramatically (Dice from 0.09 to 0.66 for kidneys going from contrast to non-contrast CT). This is an emerging area with high potential.
Raw model predictions often contain artifacts that simple post-processing can fix. These are “free” improvements that don’t require retraining.
BraTS regions are nested: ET ⊂ TC ⊂ WT. But the model can produce predictions that violate this hierarchy (e.g., ET voxels outside TC). Post-processing enforces the hierarchy: any ET voxel must also be TC, any TC voxel must also be WT. If you’re using nnU-Net v2’s region-based training, this is handled automatically. Otherwise, apply it as a post-processing step.
Small isolated predictions (a few voxels of “tumor” floating in normal brain) are almost always false positives. Remove connected components below a size threshold. nnU-Net automatically determines whether this helps based on cross-validation. For brain tumors, keeping only the largest connected component per class is often effective.
CRFs refine boundaries by incorporating image intensity information into the segmentation. Posterior-CRF (an end-to-end method using CNN features) outperformed other CNN-CRF approaches for multiple anatomies. For brain tumors, multi-level CRFs have been used to detect small lesions with 90% detection rate and very few false positives. CRFs add inference time but can meaningfully improve boundary quality.
An emerging approach: adapting the model’s parameters to each test case at inference time using self-supervised objectives (no labels needed). Information-geometric approaches improved generalization for brain tumor segmentation across domains. Uncertainty-guided test-time optimization dynamically adjusts per-patient, improving Dice, HD95, and mean surface distance across multiple datasets.
Labeled medical data is scarce and expensive. But unlabeled data is abundant — every MRI scan that doesn’t have expert segmentation is potential training data. Semi-supervised methods leverage this.
Train a model on your labeled data, use it to predict segmentations on unlabeled data (pseudo-labels), then retrain on both real labels and pseudo-labels. The key challenge: noisy pseudo-labels can hurt more than they help. Self-aware confidence estimation for selecting only reliable pseudo-labels improved performance on cardiac, pancreas, and nerve segmentation tasks. This simple loop is surprisingly effective when done carefully.
Encourage the model to produce the same prediction under different perturbations of the same input. If the model predicts differently when you flip or add noise to the image, it’s not confident, and this inconsistency provides a training signal without labels. The Mean Teacher approach and its variants have demonstrated effectiveness on skin lesion, optic disc, and liver segmentation.
If you have a limited annotation budget, which images should you label? Active learning selects the most informative samples. One framework achieved full accuracy using only 22.7–48.9% of available data by selecting samples that maximally reduce model uncertainty. Fisher information-based selection outperformed competing methods after labeling less than 0.25% of the target data. This is directly relevant if you have clinical data without annotations.
A 5-fold ensemble with TTA might be great for a challenge submission, but it’s too slow for clinical deployment. Knowledge distillation trains a small, fast “student” model to mimic the large, slow “teacher” ensemble.
Train a large ensemble (the teacher) to generate soft labels (probability maps). Then train a smaller network (the student) on these soft labels. The student learns the teacher’s “knowledge” without needing the teacher’s size. One framework achieved up to 32.6% Dice improvement in the student network by transferring semantic region information. LCOV-Net outperformed nnU-Net with only one-fifth the parameters.
Growing Teacher Assistant Networks enabled a student with only 8% of the teacher’s parameters (100k total) to achieve comparable performance, with inference in just 13ms. With only 2% of parameters (30k), the model still maintained 95%+ of teacher performance. Graph flow distillation improved ET and TC Dice by 5% and 3% respectively in the student. This makes real-time clinical deployment feasible.