Week 08 — Q3

Fine-Tuning & Improving Your Model

Your baseline model is trained. Now the real work begins — squeezing out every Dice point through ensembling, test-time augmentation, advanced loss functions, error analysis, and the specific modifications that separate a solid submission from a challenge-winning one.

The Improvement Playbook

After training a baseline nnU-Net model (Week 5–6) and evaluating it (Week 7), you’ll have scores that are good but not competitive. The difference between a baseline Dice of 0.87 and a winning score of 0.93 comes from stacking small improvements systematically. Here’s the rough breakdown of where those points come from, based on published ablation studies and challenge reports:

+1–2
Dice points from 5-fold ensembling (built into nnU-Net)
+0.5–2
Dice points from test-time augmentation
+1–3
Dice points from advanced losses, post-processing, & architecture changes
💡
The golden rule of model improvement: Change one thing at a time. Record the validation Dice for each sub-region (WT, TC, ET) after every modification. Keep changes that improve performance, revert those that don’t. Top BraTS teams follow this disciplined ablation approach — not heroic architectural overhauls.

Model Ensembling

Ensembling is the single most reliable way to improve segmentation performance. The core idea: multiple independent models make different errors, and averaging their predictions cancels out individual mistakes. Studies consistently show 2–7% Dice improvements and substantial HD95 reductions from ensembling.

Fold Ensembling (Already Built In)

nnU-Net’s 5-fold cross-validation produces 5 independently trained models. At inference, their softmax outputs are averaged before argmax. This is free — you already have the models. It typically adds 1–2 Dice points over any single fold and strongly reduces catastrophic failures. One study found ensembles eliminated outlier predictions in 68–100% of high-risk images.

Cross-Configuration Ensembling

Average predictions from different nnU-Net configurations (2D + 3D full-res). The 2D model may better capture sharp in-plane boundaries while the 3D model captures inter-slice continuity. nnU-Net automatically evaluates whether cross-configuration ensembling improves cross-validation scores.

Cross-Architecture Ensembling

The biggest gains come from ensembling different architectures that make different kinds of errors. Combining nnU-Net with a transformer-based model (SwinUNETR, nnUNetFormer) captures complementary features. The EnsembleUNets approach achieved Dice 0.93 on BraTS 2021, outperforming all individual models. One BraTS strategy used weighted ensembles of the top-3 models per region with customized loss functions, requiring 67% less memory and 92% less training time than training all architectures independently.

Weighted vs Simple Averaging

Simple averaging works surprisingly well, but learnable ensemble weights (optimized on the validation set) consistently outperform static averaging. You can optimize per-region weights: maybe the 3D model gets more weight for WT while the transformer model gets more weight for ET. A study found optimized weights yielded 2–7% DSC improvements and up to 49% reduction in average surface distance.

📚
Practical tip: Start by training nnU-Net (3d_fullres, all 5 folds) as your backbone. Then train one alternative — either nnU-Net with the residual encoder, or a SegResNet/SwinUNETR via MONAI. Average the two. This two-model ensemble often captures 80% of the gain you’d get from a five-model ensemble, at a fraction of the compute cost.

Test-Time Augmentation (TTA)

TTA is like asking your model to look at the same scan from multiple angles and averaging its opinions. During inference, apply augmentations (flipping, rotation) to the input, run the model on each augmented version, reverse the augmentations on the predictions, and average the results.

What to Augment at Test Time

Flipping along each axis is the most common and cheapest TTA. For a 3D volume, flipping along x, y, z, and all combinations gives 8 versions (including the original). This alone provides 0.5–1.5 Dice points on typical tasks. Rotation (small angles) and scaling (0.95–1.05×) can add marginal improvements but cost more compute. A study on 9 medical segmentation datasets showed TTA provided 0.1–2.3% Dice improvement and 1.1–29.0% improvement in error estimation.

Computational Cost

Each TTA variant requires one additional forward pass. With 8-way flipping TTA and 5-fold ensembling, you’re running 40 forward passes per image — inference takes 40× longer. For BraTS volumes, this means ~10–20 minutes per case instead of ~30 seconds. Acceptable for challenge submissions, but too slow for real-time clinical use. Jointly optimizing training-time and test-time augmentation can improve efficiency.

Free Bonus: Uncertainty Estimation

The variance across TTA predictions provides a voxel-wise uncertainty estimate. High variance = the model is unsure. This is clinically valuable: you can flag uncertain regions for radiologist review. Studies show TTA-based aleatoric uncertainty outperforms Monte Carlo dropout for uncertainty estimation and reduces overconfident predictions.

Error Analysis: Where Is Your Model Failing?

Before trying to improve your model, you need to understand where and how it fails. Random tweaking is less effective than targeted fixes. Load predictions alongside ground truth in 3D Slicer and systematically look for patterns.

STEP 1
Per-Region Breakdown

Compute Dice and HD95 separately for WT, TC, and ET. Which region is weakest? For most models, ET is the bottleneck — it’s the smallest region with the most irregular boundaries. If your ET Dice is 0.75 but WT is 0.91, focus improvements on ET.

STEP 2
Size-Stratified Analysis

Split cases by tumor size (small / medium / large). Small enhancing tumors are disproportionately hard — a few missed voxels in a 200-voxel ET can drop Dice from 0.85 to 0.40. If small tumors are your main failure mode, consider a region-related focal loss that upweights hard voxels.

STEP 3
Failure Case Visualization

Sort patients by Dice score and inspect the worst 10–20 cases in 3D Slicer. Common patterns: false positives near surgical cavities, under-segmentation of diffuse edema, missed satellite lesions, and enhancing tumor confused with blood vessels. Each pattern suggests a different fix.

STEP 4
Institutional Bias Check

If your data comes from multiple centers, compute per-center metrics. A model trained mainly on 3T MRI may fail on 1.5T data. One study showed performance dropped 8–17% Dice when training and testing institutions differed. This finding motivates domain adaptation or institution-aware augmentation.

💡
Automated error detection: SegQC is a framework that computes segmentation error probabilities per voxel and identifies likely failures in individual slices, achieving 74–77% recall for error detection. Integrating this into your evaluation pipeline can scale error analysis beyond manual inspection.

Advanced Loss Functions

The default Dice + CE loss is robust, but specialized losses can target specific weaknesses revealed by error analysis. The key insight from a comprehensive analysis of 20 loss functions: compound losses are always more robust than any single loss.

Boundary Loss (for Improving HD95)

Standard Dice loss measures volumetric overlap — it doesn’t directly penalize boundary errors. Boundary loss formulates the optimization as a distance metric on the contour space, using the distance transform of the ground truth. It’s particularly effective for highly imbalanced structures (small ET regions). A boundary-sensitive variant improved DSC by 4.17% and HD95 by 73% for challenging boundary regions. Important: boundary loss should be added gradually during training (loss scheduling), not from the start — the model needs Dice/CE to establish rough segmentation first.

Hausdorff Distance Loss

Directly optimizes the metric you’re evaluated on. Three approaches exist: distance-transform based, morphological erosion based, and convolution-kernel based. Training with HD loss achieved 18–45% reduction in Hausdorff Distance without degrading Dice scores. This is particularly useful when your Dice is good but HD95 is poor — a common pattern where the model makes occasional outlier predictions far from the true boundary.

Region-Related Focal Loss (for Small ET)

Standard loss treats all voxels equally, but misclassifying voxels near the boundary of a tiny enhancing tumor matters far more than misclassifying a voxel deep inside a large edema region. Region-related focal loss dynamically upweights hard-to-classify voxels with region-specific focus, improving ET Dice by 3% and average Dice by 1% on BraTS 2020. This is one of the highest-impact single changes for brain tumor segmentation.

Topological Losses (for Preserving Connectivity)

Dice and HD don’t capture whether your prediction has the right topology — a prediction with a hole in the middle of a tumor might have high Dice but be clinically wrong. Persistent homology loss (using Betti numbers) can enforce topological constraints. A fast Euler characteristic method provides efficient topological optimization for 3D data, with significant improvements in structural correctness. Most useful for tasks where connectivity is critical (vascular segmentation, cortical parcellation).

⚠️
Loss scheduling matters: Don’t add boundary loss or HD loss from epoch 1. The model needs to learn basic segmentation first (epochs 1–300 with Dice+CE), then you introduce the refinement loss (epochs 300–1000). One practical approach: start with 100% Dice+CE, then linearly ramp boundary loss from 0% to 50% over epochs 300–600. Fixed vs adaptive weighting can differ by architecture.

Transfer Learning & Fine-Tuning

Instead of training from random initialization, you can start from weights pre-trained on a related task. This is especially powerful when your target dataset is small.

Domain-Specific Pre-training Beats ImageNet

A key finding: pre-training on medical images outperforms ImageNet pre-training, even when the medical dataset is 10× smaller. The closer the pre-training domain and task are to your target, the better the transfer. For brain tumor segmentation, pre-training on other brain MRI tasks (skull stripping, brain parcellation) provides better features than pre-training on natural photographs. RadiologyNET (1.9M medical images) showed particular advantages in resource-limited settings.

Fine-Tuning Strategy

For U-Net architectures, a counterintuitive finding: fine-tuning shallow layers (early encoder) often works better than fine-tuning deep layers, because shallow layers learn low-level features critical for segmentation that differ between domains. The recommended approach: fine-tune the entire network with a lower learning rate (1/10th to 1/100th of the from-scratch rate), with optional warm-up over the first 5–10 epochs. Fine-tuned models consistently match or outperform from-scratch training and are more robust to small training set sizes.

Parameter-Efficient Fine-Tuning (PEFT)

With foundation models like SAM, you can fine-tune a tiny fraction of parameters while freezing the bulk of the pre-trained model. PEFT methods achieve competitive performance with as few as 1–5 labeled samples. SemiSAM+ demonstrated full accuracy using only 22–49% of available labeled data through specialist-generalist collaborative learning. This is the future direction for low-data scenarios.

Advanced Data Augmentation

nnU-Net’s default augmentation (rotation, scaling, elastic deformation, gamma, noise, mirroring) is strong, but specialized augmentations can target specific weaknesses.

CarveMix: Lesion-Aware Mixing

Instead of mixing random image regions, CarveMix carves out regions based on lesion location and geometry, then transplants them into other images. This creates realistic training samples with novel lesion-context combinations, improving brain lesion segmentation accuracy. It’s particularly valuable for rare tumor appearances.

Adversarial Data Augmentation

AdvChain generates randomly chained geometric and photometric transformations that resemble realistic but challenging imaging variations. It can alleviate the need for labeled data while improving model generalization, applicable for both low-shot supervised and semi-supervised learning. Demonstrated for cardiac and prostate MRI segmentation.

Synthetic Data via Diffusion Models

Multi-Channel Fusion Diffusion Models can generate synthetic brain tumor MRI with all four modalities. Adding synthetic data improved classification accuracy by ~3% and segmentation Dice by 1.5–2.5%. CycleGAN-based domain translation also improved out-of-distribution performance dramatically (Dice from 0.09 to 0.66 for kidneys going from contrast to non-contrast CT). This is an emerging area with high potential.

Post-Processing Refinement

Raw model predictions often contain artifacts that simple post-processing can fix. These are “free” improvements that don’t require retraining.

Hierarchical Enforcement for BraTS

BraTS regions are nested: ET ⊂ TC ⊂ WT. But the model can produce predictions that violate this hierarchy (e.g., ET voxels outside TC). Post-processing enforces the hierarchy: any ET voxel must also be TC, any TC voxel must also be WT. If you’re using nnU-Net v2’s region-based training, this is handled automatically. Otherwise, apply it as a post-processing step.

Connected Component Filtering

Small isolated predictions (a few voxels of “tumor” floating in normal brain) are almost always false positives. Remove connected components below a size threshold. nnU-Net automatically determines whether this helps based on cross-validation. For brain tumors, keeping only the largest connected component per class is often effective.

Conditional Random Fields (CRFs)

CRFs refine boundaries by incorporating image intensity information into the segmentation. Posterior-CRF (an end-to-end method using CNN features) outperformed other CNN-CRF approaches for multiple anatomies. For brain tumors, multi-level CRFs have been used to detect small lesions with 90% detection rate and very few false positives. CRFs add inference time but can meaningfully improve boundary quality.

Test-Time Adaptation

An emerging approach: adapting the model’s parameters to each test case at inference time using self-supervised objectives (no labels needed). Information-geometric approaches improved generalization for brain tumor segmentation across domains. Uncertainty-guided test-time optimization dynamically adjusts per-patient, improving Dice, HD95, and mean surface distance across multiple datasets.

Semi-Supervised & Self-Supervised Learning

Labeled medical data is scarce and expensive. But unlabeled data is abundant — every MRI scan that doesn’t have expert segmentation is potential training data. Semi-supervised methods leverage this.

Pseudo-Labeling

Train a model on your labeled data, use it to predict segmentations on unlabeled data (pseudo-labels), then retrain on both real labels and pseudo-labels. The key challenge: noisy pseudo-labels can hurt more than they help. Self-aware confidence estimation for selecting only reliable pseudo-labels improved performance on cardiac, pancreas, and nerve segmentation tasks. This simple loop is surprisingly effective when done carefully.

Consistency Regularization

Encourage the model to produce the same prediction under different perturbations of the same input. If the model predicts differently when you flip or add noise to the image, it’s not confident, and this inconsistency provides a training signal without labels. The Mean Teacher approach and its variants have demonstrated effectiveness on skin lesion, optic disc, and liver segmentation.

Active Learning: Choosing What to Label

If you have a limited annotation budget, which images should you label? Active learning selects the most informative samples. One framework achieved full accuracy using only 22.7–48.9% of available data by selecting samples that maximally reduce model uncertainty. Fisher information-based selection outperformed competing methods after labeling less than 0.25% of the target data. This is directly relevant if you have clinical data without annotations.

Knowledge Distillation & Model Compression

A 5-fold ensemble with TTA might be great for a challenge submission, but it’s too slow for clinical deployment. Knowledge distillation trains a small, fast “student” model to mimic the large, slow “teacher” ensemble.

Teacher-Student Framework

Train a large ensemble (the teacher) to generate soft labels (probability maps). Then train a smaller network (the student) on these soft labels. The student learns the teacher’s “knowledge” without needing the teacher’s size. One framework achieved up to 32.6% Dice improvement in the student network by transferring semantic region information. LCOV-Net outperformed nnU-Net with only one-fifth the parameters.

Extreme Compression

Growing Teacher Assistant Networks enabled a student with only 8% of the teacher’s parameters (100k total) to achieve comparable performance, with inference in just 13ms. With only 2% of parameters (30k), the model still maintained 95%+ of teacher performance. Graph flow distillation improved ET and TC Dice by 5% and 3% respectively in the student. This makes real-time clinical deployment feasible.

💡
The BraTS competition strategy in full: (1) Train nnU-Net 3d_fullres as baseline. (2) Enable region-based training. (3) Try the residual encoder. (4) Train one alternative architecture (transformer-based). (5) Ensemble the top performers with optimized weights. (6) Add TTA at inference. (7) Apply hierarchical post-processing. Each step adds a fraction of a Dice point, but together they compound to competitive performance.

This Week’s Learning Resources

Hands-On Practice

Instructions for running cross-configuration ensembling and enabling test-time augmentation in nnU-Net. Start by comparing results with and without TTA on your validation set.
Guide for creating custom trainer classes. This week, try creating a trainer that adds boundary loss to the default Dice+CE. Test it on one fold and compare HD95.
Load your model’s predictions and ground truth side by side. Use the “Compare Volumes” module to overlay differences. Identify the failure patterns from Section 8.4.

Key Papers

The foundational boundary loss paper. Formulates segmentation as a distance metric on contour space. Essential reading for improving HD95 scores.
Med Image Anal. 2021;67:101851
Three methods for directly optimizing HD as a loss function, achieving 18–45% HD reduction without sacrificing Dice. The go-to paper when your HD95 is poor.
IEEE Trans Med Imaging. 2020;39(2):499–513
Demonstrates the highest-impact nnU-Net modification for BraTS: fusing transformer modules into deeper layers. Achieved 0.936/0.921/0.872 for WT/TC/ET with ablation showing transformers improved TC most.
Phys Med Biol. 2023;68(23):235009
+3% ET Dice and +1% average Dice through dynamic hard-voxel weighting. The highest-impact single loss function change for brain tumor segmentation.
Med Phys. 2023;50(4):2203–2215
The definitive benchmark of 20 loss functions across 6 datasets. Compound losses win. Essential for choosing your loss function strategy.
Med Image Anal. 2021;71:102035

Deep Dives (Advanced)

Proves medical pre-training beats ImageNet for segmentation. The evidence base for choosing your pre-training strategy.
Cross-layer graph distillation improving ET Dice by 5% in student networks. Key paper for deploying compressed models.
Combining pixel entropy, regional consistency, and image diversity for smart sample selection. Reduce your annotation budget while maintaining accuracy.
Realistic adversarial augmentation chains for medical imaging. Improves generalization for low-shot and semi-supervised settings.