Week 07 — Q3

Understanding Evaluation Metrics for Image Segmentation

You’ve trained your first models. Now you need to know if they’re actually good — and “good” is more complicated than a single number. This week you’ll learn every metric used to evaluate brain tumor segmentation, what each one rewards and penalizes, and the pitfalls that make experienced researchers misinterpret their own results.

Why One Number Is Never Enough

Imagine two brain tumor segmentation models. Model A achieves a Dice score of 0.88. Model B achieves 0.87. Model A is better, right? Not necessarily. Model A might have excellent overlap with the tumor but terrible boundary accuracy — its predictions are blobby and imprecise. Model B might have slightly less overlap but much sharper, more clinically useful boundaries. A radiation oncologist planning treatment would prefer Model B every time.

This is why the BraTS challenge uses multiple metrics, and why understanding what each one measures is essential. A landmark 2024 Nature Methods paper by the Metrics Reloaded consortium documented dozens of pitfalls in how researchers choose and interpret metrics, and a 2018 Nature Communications study showed that different metrics produce different challenge winners — meaning the “best” algorithm depends entirely on how you define “best.”

8–25%
Departure from linearity between Dice score and clinical acceptability
0.74–0.85
Human inter-rater Dice on BraTS — the baseline for “good enough”
100–200
Test samples needed for a reliable 1% confidence interval on Dice

Dice Similarity Coefficient (DSC)

Dice is the most widely used segmentation metric and the primary ranking metric in BraTS. It measures the overlap between your predicted segmentation and the ground truth.

DSC = 2|A ∩ B| / (|A| + |B|)
Where A = predicted segmentation, B = ground truth. Range: 0 (no overlap) to 1 (perfect overlap). Equivalent to the F1-score at the voxel level.

What Does a Dice Score Actually Mean?

A Dice of 0.85 means that 85% of the combined volume of prediction and ground truth overlaps. For BraTS, state-of-the-art models achieve roughly: WT: 0.90–0.93 (whole tumor is large and well-defined), TC: 0.87–0.92 (tumor core is medium-sized), ET: 0.82–0.87 (enhancing tumor is smallest and hardest). Human inter-rater agreement is 0.74–0.85, so any model above ~0.85 is performing at or above expert-level agreement.

⚠️
The Dice trap: A study demonstrated that the same Dice value represents different levels of clinical acceptability for different structures depending on their size, shape, and complexity. The departure from linearity reached 8–25%. A Dice of 0.85 on a large whole tumor means something very different than 0.85 on a tiny enhancing tumor. Additionally, Dice has a theoretical region-size bias — it’s inherently easier to get a high Dice on large structures and harder on small ones.

Soft Dice vs Hard Dice

During training, soft Dice operates on continuous probability maps (0.0–1.0) so it’s differentiable and can be used as a loss function. During evaluation, hard Dice is computed on binarized predictions (each voxel is either 0 or 1). This distinction matters: a model optimized with soft Dice loss directly optimizes for the evaluation metric, which is why Dice-based losses outperform cross-entropy when the evaluation metric is Dice.

📚
Statistical note: Dice scores are not normally distributed. A 2004 study recommended applying a logit transformation before using parametric statistical tests like ANOVA. For comparing two models, use the Wilcoxon signed-rank test (non-parametric, paired). For confidence intervals, bootstrap methods are reliable. nnU-Net uses the Wilcoxon test internally for comparing configurations.

Hausdorff Distance (HD95)

While Dice measures overall volume overlap, the Hausdorff Distance measures the worst-case boundary error — how far the predicted boundary deviates from the true boundary at its worst point. It catches errors that Dice misses: a prediction that is mostly correct but has one wildly wrong region.

HD(A,B) = max( maxa∈∂A minb∈∂B d(a,b),  maxb∈∂B mina∈∂A d(a,b) )
The maximum of the directed Hausdorff distances. Measured in millimeters. Lower is better.

The standard Hausdorff Distance is extremely sensitive to single outlier predictions — one misclassified voxel 50mm away from the tumor boundary will produce an HD of 50mm, even if 99.99% of the prediction is perfect. This is why BraTS uses HD95 — the 95th percentile of surface distances, which is robust to isolated outliers while still capturing systematic boundary errors.

Clinical Interpretation

What HD95 Values Mean

HD95 < 5mm: Excellent. 95% of the boundary is within 5mm of truth. Generally clinically acceptable for treatment planning. Top BraTS models achieve this for WT and TC.

HD95 = 5–15mm: Moderate. Noticeable boundary errors that could affect surgical margins or radiation target volumes. Typical for ET (smaller, more irregular).

HD95 > 20mm: Poor. Substantial boundary errors that would likely impact clinical decisions. Indicates the model is failing in certain regions.

A study proposed three methods to directly optimize Hausdorff Distance as a loss function, achieving 18–45% HD reduction without degrading Dice. This matters because Dice and HD are not strongly correlated — improving one doesn’t automatically improve the other.

Normalized Surface Distance (NSD)

NSD is a newer metric adopted by recent BraTS challenges that answers a more clinically meaningful question: what fraction of the predicted surface is within an acceptable tolerance of the true surface?

NSD@τ = |{a ∈ ∂A : d(a, ∂B) ≤ τ}| / |∂A|
Fraction of predicted surface points within tolerance τ of the reference surface. τ is typically 1mm or 2mm based on clinical relevance. Range: 0 to 1.

The tolerance τ is the key parameter. Setting τ = 2mm means “we don’t care about boundary errors smaller than 2mm.” This is clinically motivated — a 1mm boundary error is within the range of human inter-rater variability and has no practical impact on treatment planning. NSD essentially filters out noise in the boundary assessment.

When NSD and Dice Disagree

NSD can reveal problems that Dice hides and vice versa. A prediction with excellent overall overlap (high Dice) but poor boundary definition (blurry, imprecise edges) will have a high Dice but low NSD. Conversely, a prediction with razor-sharp boundaries but systematic under-segmentation (the predicted region is entirely contained within the ground truth but much smaller) will have decent NSD but poor Dice. Both perspectives are needed for a complete picture. Top brain metastases detection models achieve NSD of 0.99 alongside Dice of 0.89–0.90.

Sensitivity, Specificity & Precision

These familiar classification metrics are applied voxel-by-voxel in segmentation. Each voxel is treated as a binary classification: tumor or not-tumor.

Sensitivity (Recall) = TP / (TP + FN)

Of all the voxels that are truly tumor, what fraction did the model correctly identify? High sensitivity = few missed tumor voxels. Clinically critical — missing part of a tumor can lead to under-treatment. A model with low sensitivity is “leaving tumor behind.”

Precision (PPV) = TP / (TP + FP)

Of all the voxels the model predicted as tumor, what fraction are actually tumor? High precision = few false alarms. Clinically important because false positives can lead to unnecessary treatment of healthy tissue. The precision-recall trade-off is fundamental: you can increase sensitivity by predicting more voxels as tumor, but precision drops.

Specificity = TN / (TN + FP)

Of all the non-tumor voxels, what fraction did the model correctly identify as non-tumor? In brain tumor segmentation, specificity is almost always > 0.99 because the background (non-tumor brain) vastly outnumbers the tumor. This makes specificity a poor discriminator between algorithms — every model gets it right on the background. A 2009 study showed specificity incorporates image background properties that obscure true segmentation quality.

💡
For BraTS, focus on sensitivity and precision (or equivalently, Dice). Specificity is reported but is not informative for ranking algorithms because the class imbalance makes it artificially high for everyone. Dice is mathematically the harmonic mean of precision and recall, so it captures the precision-recall trade-off in a single number.

Volumetric Metrics & RANO

In clinical practice, what matters most is often not the pixel-perfect boundary but the overall volume of the tumor. Volumetric metrics measure whether your segmentation gets the total tumor size right, which directly feeds into treatment response assessment.

Why Volume Matters: The RANO Connection

The RANO criteria (Week 1) classify tumors as responding, stable, or progressing based on size changes. Traditional RANO used 2D diameter measurements, but RANO 2.0 now includes volumetric assessment. The landmark Kickingereder et al. study (Lancet Oncology, 2019) showed that volumetric measurements are more reliable than 2D measurements because brain tumors grow in complex, non-spherical shapes. For brain metastases, a 30% unidimensional reduction corresponds to roughly 65% volumetric reduction, and volumetric changes of ≥20% are reproducible between readers.

A study of RANO-based assessment on lower-grade gliomas found poor-to-moderate inter-operator reproducibility (correlation r = 0.28–0.82, accuracy 21.0%) with traditional 2D measurements. Automated volumetric segmentation directly addresses this problem.

Key Volumetric Metrics

Absolute Volume Difference: |Vpred − Vtruth| in mm³ or mL. Simple and clinically interpretable.

Relative Volume Difference: (Vpred − Vtruth) / Vtruth. Positive = over-segmentation, negative = under-segmentation. A model with 0% relative volume difference and 0.85 Dice has the right total volume but imperfect spatial placement.

Volume Correlation: Correlation between predicted and true volumes across patients. High correlation (> 0.9) means the model tracks size changes reliably, even if individual predictions aren’t pixel-perfect.

⚠️
Volume can be right while spatial accuracy is wrong. A model could predict the correct total tumor volume but in the wrong location — volumetric metrics alone don’t catch this. Conversely, Dice can be moderate while volume is accurate. This is why reporting both overlap metrics (Dice) and volumetric metrics gives the most complete picture.

How BraTS Evaluates Your Submission

Understanding exactly how BraTS computes your score is essential for competition strategy. Here’s the complete evaluation framework:

The Three Evaluation Regions

Your model predicts individual label classes (NCR, ED, ET), but BraTS evaluates on three nested regions that combine those classes. Every metric is computed separately for each region:

RegionComposed OfLabelsClinical MeaningTypical Dice
Whole Tumor (WT)NCR + ED + ET{1, 2, 4}Total disease extent0.90–0.93
Tumor Core (TC)NCR + ET{1, 4}Solid tumor mass0.87–0.92
Enhancing Tumor (ET)ET only{4}Active, growing tumor0.82–0.87

Metrics Per Region

For each of the three regions, BraTS computes: Dice Similarity Coefficient, HD95 (95th percentile Hausdorff Distance), and in recent iterations Normalized Surface Distance, plus sensitivity and specificity. That’s up to 15 numbers per patient.

How Rankings Work

Participants are ranked based on aggregate performance across all metrics and regions. However, a Nature Communications study demonstrated that rankings are not robust to changes in test data, ranking scheme, or annotators. Different weighting of metrics and regions can produce different winners from the same results. This means marginal improvements (<0.5% Dice) are often within noise.

💡
Competition strategy: Don’t optimize for a single metric. Focus on improving your worst region (usually ET) because aggregate rankings penalize inconsistency. A model with 0.90/0.88/0.85 across WT/TC/ET will typically rank higher than one with 0.93/0.90/0.78 despite having a similar or higher average Dice.

Lesion-Wise Evaluation for Brain Metastases

For brain metastases (BraTS-METS), voxel-wise Dice isn’t enough. A patient might have 15 small metastases. A model could achieve a high patient-wise Dice by correctly segmenting the 5 largest lesions while completely missing 10 small ones. Clinically, those missed lesions could be catastrophic.

Lesion-wise metrics address this by evaluating each individual lesion separately:

Lesion-Wise Detection

Lesion-wise sensitivity: Of all true lesions, what fraction was detected? A systematic review of 42 studies reported pooled lesion-wise sensitivity of 87%. The best current models achieve 98% sensitivity internally and 97.4% externally, with sensitivity of 93.3% even for tiny lesions <3mm.

False positive rate per patient: How many “phantom” lesions does the model hallucinate? Top models achieve just 0.6 false positives per patient.

Lesion-wise Dice: Dice computed per individual lesion, then averaged. Pooled lesion-wise Dice across 42 studies was 79% — lower than patient-wise Dice because small lesions drag the average down.

Size-stratified evaluation is critical: models typically perform well on large metastases (>12mm: sensitivity 98%, FPR 0.3) but struggle with small ones. In one study, the model even detected 7 lesions that human readers had missed during manual delineation.

Metric Pitfalls That Trip Up Everyone

Pitfall 1: High Dice ≠ Clinical Utility

A critical review found that commonly used geometric indices like DSC are not well correlated with clinically meaningful endpoints. A study on stroke lesion segmentation showed that nnU-Net achieved excellent segmentation metrics but failed to detect therapy-induced volume reductions, leading to false-negative study outcomes. Two segmentations with similar Dice can produce significantly different medical treatment results depending on whether they over-contour or under-contour the tumor.

Pitfall 2: Averaging Across Patients

Reporting a single “mean Dice” across all patients hides critical variation. A patient with a 100mL tumor and a patient with a 2mL tumor contribute equally to the mean, but their clinical significance and segmentation difficulty are vastly different. Always report per-region metrics and consider median (more robust to outliers) alongside mean. Better yet, report confidence intervals — a 2025 study showed you need 100–200 test samples for a reliable 1% confidence interval.

Pitfall 3: Dice is Overconfident About Large Regions

Dice has an intrinsic region-size bias. For large structures (whole tumor), even a sloppy prediction achieves a respectable Dice because the large overlapping volume dominates. For small structures (enhancing tumor), a small boundary error causes a large Dice drop. This is why WT Dice is always highest and ET Dice is always lowest — it’s partly the metric, not just the model.

Pitfall 4: Calibration Matters for Deployment

If you’re deploying a model clinically, you need to trust its confidence scores, not just its binary predictions. A study showed that models trained with Dice loss produce overconfident, poorly calibrated predictions. A model says “90% confident this is tumor” but is only right 70% of the time. Model ensembling (what nnU-Net does) improves calibration. For serious clinical use, calibration metrics (Expected Calibration Error) should be tracked alongside Dice and HD95.

Pitfall 5: Different Implementations Give Different Results

The “same” metric can produce different numbers depending on the implementation. Hausdorff Distance is particularly sensitive: choices about connectivity (6-connected vs 26-connected in 3D), voxel spacing handling, and edge cases (what happens when a class is absent) all affect the result. A 2015 paper proposed an efficient algorithm that outperforms the standard ITK implementation. Use the official BraTS evaluation toolkit for challenge submissions to ensure your numbers match the leaderboard’s.

Statistical Comparison of Models

Saying “Model A got 0.88 Dice and Model B got 0.87 Dice, so Model A is better” is not science. You need to determine whether the difference is statistically significant and practically meaningful.

Recommended Statistical Approach

Paired comparisons: Use the Wilcoxon signed-rank test (non-parametric, doesn’t assume normality, appropriate for Dice scores). Both models must be evaluated on the same test cases.

Confidence intervals: Bootstrap confidence intervals are reliable without distributional assumptions. A 2025 study showed that parametric CIs are reasonable approximations of bootstrap estimates for segmentation metrics.

Multiple comparisons: If comparing more than two models, apply Bonferroni or Holm correction to control the false discovery rate. Comparing 10 models at p=0.05 without correction means you’ll find ~1 false positive by chance.

Effect size: Report effect sizes alongside p-values. A statistically significant difference of 0.002 Dice points may not be clinically meaningful.

Sample size: A statistical power calculation for segmentation showed that the predicted sample size was accurate within 4 subjects of Monte Carlo estimates. For a 1% wide CI, plan for 100–200 test cases when variance is low.

Computing Metrics in Python

import numpy as np
from medpy.metric.binary import dc, hd95, sensitivity, specificity
import nibabel as nib

# Load prediction and ground truth
pred = nib.load('prediction.nii.gz').get_fdata()
gt = nib.load('ground_truth.nii.gz').get_fdata()
spacing = nib.load('prediction.nii.gz').header.get_zooms()

# Compute BraTS regions
regions = {
  'WT': (lambda x: np.isin(x, [1, 2, 4])),
  'TC': (lambda x: np.isin(x, [1, 4])),
  'ET': (lambda x: x == 4),
}

for name, region_fn in regions.items():
  p = region_fn(pred).astype(np.uint8)
  g = region_fn(gt).astype(np.uint8)

  if g.sum() == 0 and p.sum() == 0:
    print(f"{name}: Both empty (Dice=1.0, HD95=0.0)")
    continue
  if g.sum() == 0 or p.sum() == 0:
    print(f"{name}: One side empty (Dice=0.0)")
    continue

  dice = dc(p, g)
  hausdorff = hd95(p, g, voxelspacing=spacing)
  sens = sensitivity(p, g)

  print(f"{name}: Dice={dice:.4f}, HD95={hausdorff:.2f}mm, "
        f"Sens={sens:.4f}")
Python Packages for Metrics

medpy: pip install medpy — Provides Dice, Jaccard, HD, HD95, ASSD, sensitivity, specificity. The most commonly used package.

surface-distance (DeepMind): pip install surface-distance — Computes surface distances, NSD, and surface Dice. More accurate for boundary metrics.

miseval: pip install miseval — Standardized metric library following the Müller et al. guideline. Designed for reproducible evaluation.

SimpleITK: Built-in OverlapMeasuresImageFilter for Dice, Jaccard, volume similarity, false positive/negative rates.

Metric Summary Table

MetricMeasuresRangeStrengthsLimitations
Dice (DSC)Volumetric overlap0–1 (↑)Simple, F1-equivalent, primary BraTS metricSize-biased, non-linear with acceptability
HD95Worst-case boundary error (95th %ile)0–∞ mm (↓)Catches boundary errors Dice misses, robust to outliersIgnores interior accuracy
NSD@τSurface accuracy within tolerance0–1 (↑)Clinically interpretable toleranceRequires τ selection
SensitivityTrue positive rate0–1 (↑)Captures missed tumorIgnores false positives
PrecisionPositive predictive value0–1 (↑)Captures false alarmsIgnores missed tumor
SpecificityTrue negative rate0–1 (↑)Standard classification metricInflated by large background; poor discriminator
Volume DiffTotal volume error0–∞ (↓)Directly clinically relevant (RANO)Insensitive to spatial errors
Lesion SensitivityPer-lesion detection rate0–1 (↑)Essential for metastasesRequires lesion matching

This Week’s Learning Resources

Essential Reading

The definitive reference for metric selection. Domain-agnostic taxonomy of pitfalls created by a multidisciplinary Delphi process. Includes a decision tree for choosing appropriate metrics based on task type. Read the “pitfalls” sections at minimum.
Nat Methods. 2024;21:195–212
Practical guideline: report DSC + at least one distance metric (HD95 or ASSD), report per-class metrics, use appropriate statistical tests, address class imbalance. Your checklist for reporting results.
BMC Res Notes. 2022;15:210
Demonstrates that algorithm rankings are not robust to test data, ranking scheme, or annotators. Essential context for understanding challenge leaderboards.
Nat Commun. 2018;9:5217

Metric Deep Dives

Proves metric-sensitive losses (soft Dice, soft Jaccard) outperform cross-entropy when evaluating with Dice. The theoretical foundation for understanding loss-metric alignment.
IEEE Trans Med Imaging. 2020;39(11):3679–3690
Theoretical analysis of Dice’s intrinsic bias toward specific region sizes. Explains why Dice works well for imbalanced segmentation but struggles with diverse class proportions.
Med Image Anal. 2024;91:103015
Three methods to optimize HD as a loss function, achieving 18–45% HD reduction. Read this when you want to improve boundary accuracy specifically.
IEEE Trans Med Imaging. 2020;39(2):499–513
How many test cases you need for reliable performance estimates. 100–200 for a 1% CI when variance is low; 1000+ for difficult tasks. Essential for planning validation studies.
Med Image Anal. 2025;103:103565

Tools

The most commonly used package for computing Dice, HD95, ASSD, sensitivity, specificity. pip install medpy. Start here for your evaluation pipeline.
Computes surface distances, NSD, and surface Dice. More precise than medpy for boundary-focused metrics. Use for NSD computation.
Open-source Python package implementing the Müller et al. evaluation guideline. Designed for reproducible, standardized evaluation. pip install miseval.

Advanced Topics

Shows Dice loss produces overconfident predictions. Model ensembling improves calibration. Critical for clinical deployment where confidence scores matter.
Meta-analysis of 42 studies: pooled lesion-wise Dice 79%, patient-wise sensitivity 86%, lesion-wise sensitivity 87%. The reference for understanding brain metastasis evaluation.
Shows that common geometric indices are not well correlated with clinical endpoints. Argues for multi-domain evaluation including dosimetric metrics and physician assessment.