Week 07: Evaluation Metrics — MedVision Academy

// 7.1 The Big Picture

Why One Number Is Never Enough

Imagine two brain tumor segmentation models. Model A achieves a Dice score of 0.88. Model B achieves 0.87. Model A is better, right? Not necessarily. Model A might have excellent overlap with the tumor but terrible boundary accuracy — its predictions are blobby and imprecise. Model B might have slightly less overlap but much sharper, more clinically useful boundaries. A radiation oncologist planning treatment would prefer Model B every time.

This is why the BraTS challenge uses multiple metrics, and why understanding what each one measures is essential. A landmark 2024 Nature Methods paper by the Metrics Reloaded consortium documented dozens of pitfalls in how researchers choose and interpret metrics, and a 2018 Nature Communications study showed that different metrics produce different challenge winners — meaning the “best” algorithm depends entirely on how you define “best.”

8–25%

Departure from linearity between Dice score and clinical acceptability

0.74–0.85

Human inter-rater Dice on BraTS — the baseline for “good enough”

100–200

Test samples needed for a reliable 1% confidence interval on Dice

// 7.2 The Primary Metric

Dice Similarity Coefficient (DSC)

Dice is the most widely used segmentation metric and the primary ranking metric in BraTS. It measures the overlap between your predicted segmentation and the ground truth.

DSC = 2|A ∩ B| / (|A| + |B|)

Where A = predicted segmentation, B = ground truth. Range: 0 (no overlap) to 1 (perfect overlap). Equivalent to the F1-score at the voxel level.

What Does a Dice Score Actually Mean?

A Dice of 0.85 means that 85% of the combined volume of prediction and ground truth overlaps. For BraTS, state-of-the-art models achieve roughly: WT: 0.90–0.93 (whole tumor is large and well-defined), TC: 0.87–0.92 (tumor core is medium-sized), ET: 0.82–0.87 (enhancing tumor is smallest and hardest). Human inter-rater agreement is 0.74–0.85, so any model above ~0.85 is performing at or above expert-level agreement.

⚠️

The Dice trap: A study demonstrated that the same Dice value represents different levels of clinical acceptability for different structures depending on their size, shape, and complexity. The departure from linearity reached 8–25%. A Dice of 0.85 on a large whole tumor means something very different than 0.85 on a tiny enhancing tumor. Additionally, Dice has a theoretical region-size bias — it’s inherently easier to get a high Dice on large structures and harder on small ones.

Soft Dice vs Hard Dice

During training, soft Dice operates on continuous probability maps (0.0–1.0) so it’s differentiable and can be used as a loss function. During evaluation, hard Dice is computed on binarized predictions (each voxel is either 0 or 1). This distinction matters: a model optimized with soft Dice loss directly optimizes for the evaluation metric, which is why Dice-based losses outperform cross-entropy when the evaluation metric is Dice.

📚

Statistical note: Dice scores are not normally distributed. A 2004 study recommended applying a logit transformation before using parametric statistical tests like ANOVA. For comparing two models, use the Wilcoxon signed-rank test (non-parametric, paired). For confidence intervals, bootstrap methods are reliable. nnU-Net uses the Wilcoxon test internally for comparing configurations.

// 7.3 Boundary Accuracy

Hausdorff Distance (HD95)

While Dice measures overall volume overlap, the Hausdorff Distance measures the worst-case boundary error — how far the predicted boundary deviates from the true boundary at its worst point. It catches errors that Dice misses: a prediction that is mostly correct but has one wildly wrong region.

HD(A,B) = max( max_a∈∂A min_b∈∂B d(a,b), max_b∈∂B min_a∈∂A d(a,b) )

The maximum of the directed Hausdorff distances. Measured in millimeters. Lower is better.

The standard Hausdorff Distance is extremely sensitive to single outlier predictions — one misclassified voxel 50mm away from the tumor boundary will produce an HD of 50mm, even if 99.99% of the prediction is perfect. This is why BraTS uses HD95 — the 95th percentile of surface distances, which is robust to isolated outliers while still capturing systematic boundary errors.

Clinical Interpretation

What HD95 Values Mean

HD95 < 5mm: Excellent. 95% of the boundary is within 5mm of truth. Generally clinically acceptable for treatment planning. Top BraTS models achieve this for WT and TC.

HD95 = 5–15mm: Moderate. Noticeable boundary errors that could affect surgical margins or radiation target volumes. Typical for ET (smaller, more irregular).

HD95 > 20mm: Poor. Substantial boundary errors that would likely impact clinical decisions. Indicates the model is failing in certain regions.

A study proposed three methods to directly optimize Hausdorff Distance as a loss function, achieving 18–45% HD reduction without degrading Dice. This matters because Dice and HD are not strongly correlated — improving one doesn’t automatically improve the other.

// 7.4 Clinically Grounded Accuracy

Normalized Surface Distance (NSD)

NSD is a newer metric adopted by recent BraTS challenges that answers a more clinically meaningful question: what fraction of the predicted surface is within an acceptable tolerance of the true surface?

NSD@τ = |{a ∈ ∂A : d(a, ∂B) ≤ τ}| / |∂A|

Fraction of predicted surface points within tolerance τ of the reference surface. τ is typically 1mm or 2mm based on clinical relevance. Range: 0 to 1.

The tolerance τ is the key parameter. Setting τ = 2mm means “we don’t care about boundary errors smaller than 2mm.” This is clinically motivated — a 1mm boundary error is within the range of human inter-rater variability and has no practical impact on treatment planning. NSD essentially filters out noise in the boundary assessment.

When NSD and Dice Disagree

NSD can reveal problems that Dice hides and vice versa. A prediction with excellent overall overlap (high Dice) but poor boundary definition (blurry, imprecise edges) will have a high Dice but low NSD. Conversely, a prediction with razor-sharp boundaries but systematic under-segmentation (the predicted region is entirely contained within the ground truth but much smaller) will have decent NSD but poor Dice. Both perspectives are needed for a complete picture. Top brain metastases detection models achieve NSD of 0.99 alongside Dice of 0.89–0.90.

// 7.5 Classification Metrics at the Voxel Level

Sensitivity, Specificity & Precision

These familiar classification metrics are applied voxel-by-voxel in segmentation. Each voxel is treated as a binary classification: tumor or not-tumor.

Sensitivity (Recall) = TP / (TP + FN)

Of all the voxels that are truly tumor, what fraction did the model correctly identify? High sensitivity = few missed tumor voxels. Clinically critical — missing part of a tumor can lead to under-treatment. A model with low sensitivity is “leaving tumor behind.”

Precision (PPV) = TP / (TP + FP)

Of all the voxels the model predicted as tumor, what fraction are actually tumor? High precision = few false alarms. Clinically important because false positives can lead to unnecessary treatment of healthy tissue. The precision-recall trade-off is fundamental: you can increase sensitivity by predicting more voxels as tumor, but precision drops.

Specificity = TN / (TN + FP)

Of all the non-tumor voxels, what fraction did the model correctly identify as non-tumor? In brain tumor segmentation, specificity is almost always > 0.99 because the background (non-tumor brain) vastly outnumbers the tumor. This makes specificity a poor discriminator between algorithms — every model gets it right on the background. A 2009 study showed specificity incorporates image background properties that obscure true segmentation quality.

💡

For BraTS, focus on sensitivity and precision (or equivalently, Dice). Specificity is reported but is not informative for ranking algorithms because the class imbalance makes it artificially high for everyone. Dice is mathematically the harmonic mean of precision and recall, so it captures the precision-recall trade-off in a single number.

// 7.6 What Clinicians Actually Measure

Volumetric Metrics & RANO

In clinical practice, what matters most is often not the pixel-perfect boundary but the overall volume of the tumor. Volumetric metrics measure whether your segmentation gets the total tumor size right, which directly feeds into treatment response assessment.

Why Volume Matters: The RANO Connection

The RANO criteria (Week 1) classify tumors as responding, stable, or progressing based on size changes. Traditional RANO used 2D diameter measurements, but RANO 2.0 now includes volumetric assessment. The landmark Kickingereder et al. study (Lancet Oncology, 2019) showed that volumetric measurements are more reliable than 2D measurements because brain tumors grow in complex, non-spherical shapes. For brain metastases, a 30% unidimensional reduction corresponds to roughly 65% volumetric reduction, and volumetric changes of ≥20% are reproducible between readers.

A study of RANO-based assessment on lower-grade gliomas found poor-to-moderate inter-operator reproducibility (correlation r = 0.28–0.82, accuracy 21.0%) with traditional 2D measurements. Automated volumetric segmentation directly addresses this problem.

Key Volumetric Metrics

Absolute Volume Difference: |V_pred − V_truth| in mm³ or mL. Simple and clinically interpretable.

Relative Volume Difference: (V_pred − V_truth) / V_truth. Positive = over-segmentation, negative = under-segmentation. A model with 0% relative volume difference and 0.85 Dice has the right total volume but imperfect spatial placement.

Volume Correlation: Correlation between predicted and true volumes across patients. High correlation (> 0.9) means the model tracks size changes reliably, even if individual predictions aren’t pixel-perfect.

⚠️

Volume can be right while spatial accuracy is wrong. A model could predict the correct total tumor volume but in the wrong location — volumetric metrics alone don’t catch this. Conversely, Dice can be moderate while volume is accurate. This is why reporting both overlap metrics (Dice) and volumetric metrics gives the most complete picture.

// 7.7 The Challenge Framework

How BraTS Evaluates Your Submission

Understanding exactly how BraTS computes your score is essential for competition strategy. Here’s the complete evaluation framework:

The Three Evaluation Regions

Your model predicts individual label classes (NCR, ED, ET), but BraTS evaluates on three nested regions that combine those classes. Every metric is computed separately for each region:

Region	Composed Of	Labels	Clinical Meaning	Typical Dice
Whole Tumor (WT)	NCR + ED + ET	{1, 2, 4}	Total disease extent	0.90–0.93
Tumor Core (TC)	NCR + ET	{1, 4}	Solid tumor mass	0.87–0.92
Enhancing Tumor (ET)	ET only	{4}	Active, growing tumor	0.82–0.87

Metrics Per Region

For each of the three regions, BraTS computes: Dice Similarity Coefficient, HD95 (95th percentile Hausdorff Distance), and in recent iterations Normalized Surface Distance, plus sensitivity and specificity. That’s up to 15 numbers per patient.

How Rankings Work

Participants are ranked based on aggregate performance across all metrics and regions. However, a Nature Communications study demonstrated that rankings are not robust to changes in test data, ranking scheme, or annotators. Different weighting of metrics and regions can produce different winners from the same results. This means marginal improvements (<0.5% Dice) are often within noise.

💡

Competition strategy: Don’t optimize for a single metric. Focus on improving your worst region (usually ET) because aggregate rankings penalize inconsistency. A model with 0.90/0.88/0.85 across WT/TC/ET will typically rank higher than one with 0.93/0.90/0.78 despite having a similar or higher average Dice.

// 7.8 Beyond Voxels

Lesion-Wise Evaluation for Brain Metastases

For brain metastases (BraTS-METS), voxel-wise Dice isn’t enough. A patient might have 15 small metastases. A model could achieve a high patient-wise Dice by correctly segmenting the 5 largest lesions while completely missing 10 small ones. Clinically, those missed lesions could be catastrophic.

Lesion-wise metrics address this by evaluating each individual lesion separately:

Lesion-Wise Detection

Lesion-wise sensitivity: Of all true lesions, what fraction was detected? A systematic review of 42 studies reported pooled lesion-wise sensitivity of 87%. The best current models achieve 98% sensitivity internally and 97.4% externally, with sensitivity of 93.3% even for tiny lesions <3mm.

False positive rate per patient: How many “phantom” lesions does the model hallucinate? Top models achieve just 0.6 false positives per patient.

Lesion-wise Dice: Dice computed per individual lesion, then averaged. Pooled lesion-wise Dice across 42 studies was 79% — lower than patient-wise Dice because small lesions drag the average down.

Size-stratified evaluation is critical: models typically perform well on large metastases (>12mm: sensitivity 98%, FPR 0.3) but struggle with small ones. In one study, the model even detected 7 lesions that human readers had missed during manual delineation.

// 7.9 Traps to Avoid

Metric Pitfalls That Trip Up Everyone

Pitfall 1: High Dice ≠ Clinical Utility

A critical review found that commonly used geometric indices like DSC are not well correlated with clinically meaningful endpoints. A study on stroke lesion segmentation showed that nnU-Net achieved excellent segmentation metrics but failed to detect therapy-induced volume reductions, leading to false-negative study outcomes. Two segmentations with similar Dice can produce significantly different medical treatment results depending on whether they over-contour or under-contour the tumor.

Pitfall 2: Averaging Across Patients

Reporting a single “mean Dice” across all patients hides critical variation. A patient with a 100mL tumor and a patient with a 2mL tumor contribute equally to the mean, but their clinical significance and segmentation difficulty are vastly different. Always report per-region metrics and consider median (more robust to outliers) alongside mean. Better yet, report confidence intervals — a 2025 study showed you need 100–200 test samples for a reliable 1% confidence interval.

Pitfall 3: Dice is Overconfident About Large Regions

Dice has an intrinsic region-size bias. For large structures (whole tumor), even a sloppy prediction achieves a respectable Dice because the large overlapping volume dominates. For small structures (enhancing tumor), a small boundary error causes a large Dice drop. This is why WT Dice is always highest and ET Dice is always lowest — it’s partly the metric, not just the model.

Pitfall 4: Calibration Matters for Deployment

If you’re deploying a model clinically, you need to trust its confidence scores, not just its binary predictions. A study showed that models trained with Dice loss produce overconfident, poorly calibrated predictions. A model says “90% confident this is tumor” but is only right 70% of the time. Model ensembling (what nnU-Net does) improves calibration. For serious clinical use, calibration metrics (Expected Calibration Error) should be tracked alongside Dice and HD95.

Pitfall 5: Different Implementations Give Different Results

The “same” metric can produce different numbers depending on the implementation. Hausdorff Distance is particularly sensitive: choices about connectivity (6-connected vs 26-connected in 3D), voxel spacing handling, and edge cases (what happens when a class is absent) all affect the result. A 2015 paper proposed an efficient algorithm that outperforms the standard ITK implementation. Use the official BraTS evaluation toolkit for challenge submissions to ensure your numbers match the leaderboard’s.

// 7.10 Doing It Right

Statistical Comparison of Models

Saying “Model A got 0.88 Dice and Model B got 0.87 Dice, so Model A is better” is not science. You need to determine whether the difference is statistically significant and practically meaningful.

Recommended Statistical Approach

Paired comparisons: Use the Wilcoxon signed-rank test (non-parametric, doesn’t assume normality, appropriate for Dice scores). Both models must be evaluated on the same test cases.

Confidence intervals: Bootstrap confidence intervals are reliable without distributional assumptions. A 2025 study showed that parametric CIs are reasonable approximations of bootstrap estimates for segmentation metrics.

Multiple comparisons: If comparing more than two models, apply Bonferroni or Holm correction to control the false discovery rate. Comparing 10 models at p=0.05 without correction means you’ll find ~1 false positive by chance.

Effect size: Report effect sizes alongside p-values. A statistically significant difference of 0.002 Dice points may not be clinically meaningful.

Sample size: A statistical power calculation for segmentation showed that the predicted sample size was accurate within 4 subjects of Monte Carlo estimates. For a 1% wide CI, plan for 100–200 test cases when variance is low.

// 7.11 Implementation

Computing Metrics in Python

      
import numpy as np

from medpy.metric.binary import dc, hd95, sensitivity, specificity

import nibabel as nib

# Load prediction and ground truth

pred = nib.load('prediction.nii.gz').get_fdata()

gt = nib.load('ground_truth.nii.gz').get_fdata()

spacing = nib.load('prediction.nii.gz').header.get_zooms()

# Compute BraTS regions

regions = {

  'WT': (lambda x: np.isin(x, [1, 2, 4])),

  'TC': (lambda x: np.isin(x, [1, 4])),

  'ET': (lambda x: x == 4),

}

for name, region_fn in regions.items():

  p = region_fn(pred).astype(np.uint8)

  g = region_fn(gt).astype(np.uint8)

  if g.sum() == 0 and p.sum() == 0:

    print(f"{name}: Both empty (Dice=1.0, HD95=0.0)")

    continue

  if g.sum() == 0 or p.sum() == 0:

    print(f"{name}: One side empty (Dice=0.0)")

    continue

  dice = dc(p, g)

  hausdorff = hd95(p, g, voxelspacing=spacing)

  sens = sensitivity(p, g)

  print(f"{name}: Dice={dice:.4f}, HD95={hausdorff:.2f}mm, "

        f"Sens={sens:.4f}")

Python Packages for Metrics

medpy: pip install medpy — Provides Dice, Jaccard, HD, HD95, ASSD, sensitivity, specificity. The most commonly used package.

surface-distance (DeepMind): pip install surface-distance — Computes surface distances, NSD, and surface Dice. More accurate for boundary metrics.

miseval: pip install miseval — Standardized metric library following the Müller et al. guideline. Designed for reproducible evaluation.

SimpleITK: Built-in OverlapMeasuresImageFilter for Dice, Jaccard, volume similarity, false positive/negative rates.

// 7.12 Quick Reference

Metric Summary Table

Metric	Measures	Range	Strengths	Limitations
Dice (DSC)	Volumetric overlap	0–1 (↑)	Simple, F1-equivalent, primary BraTS metric	Size-biased, non-linear with acceptability
HD95	Worst-case boundary error (95th %ile)	0–∞ mm (↓)	Catches boundary errors Dice misses, robust to outliers	Ignores interior accuracy
NSD@τ	Surface accuracy within tolerance	0–1 (↑)	Clinically interpretable tolerance	Requires τ selection
Sensitivity	True positive rate	0–1 (↑)	Captures missed tumor	Ignores false positives
Precision	Positive predictive value	0–1 (↑)	Captures false alarms	Ignores missed tumor
Specificity	True negative rate	0–1 (↑)	Standard classification metric	Inflated by large background; poor discriminator
Volume Diff	Total volume error	0–∞ (↓)	Directly clinically relevant (RANO)	Insensitive to spatial errors
Lesion Sensitivity	Per-lesion detection rate	0–1 (↑)	Essential for metastases	Requires lesion matching

// 7.13 Resources & Further Reading

This Week’s Learning Resources

Essential Reading

PaperReinke et al. — Metrics Reloaded: Understanding Metric Pitfalls (Nature Methods, 2024)

The definitive reference for metric selection. Domain-agnostic taxonomy of pitfalls created by a multidisciplinary Delphi process. Includes a decision tree for choosing appropriate metrics based on task type. Read the “pitfalls” sections at minimum.

Nat Methods. 2024;21:195–212

PaperMüller et al. — Guideline for Evaluation Metrics (BMC Research Notes, 2022)

Practical guideline: report DSC + at least one distance metric (HD95 or ASSD), report per-class metrics, use appropriate statistical tests, address class imbalance. Your checklist for reporting results.

BMC Res Notes. 2022;15:210

PaperMaier-Hein et al. — Why Rankings Should Be Interpreted With Care (Nature Communications, 2018)

Demonstrates that algorithm rankings are not robust to test data, ranking scheme, or annotators. Essential context for understanding challenge leaderboards.

Nat Commun. 2018;9:5217

Metric Deep Dives

PaperEelbode et al. — Optimization for Dice Score or Jaccard (IEEE TMI, 2020)

Proves metric-sensitive losses (soft Dice, soft Jaccard) outperform cross-entropy when evaluating with Dice. The theoretical foundation for understanding loss-metric alignment.

IEEE Trans Med Imaging. 2020;39(11):3679–3690

PaperLiu et al. — Hidden Region-Size Biases of Dice (Medical Image Analysis, 2024)

Theoretical analysis of Dice’s intrinsic bias toward specific region sizes. Explains why Dice works well for imbalanced segmentation but struggles with diverse class proportions.

Med Image Anal. 2024;91:103015

PaperKarimi & Salcudean — Reducing Hausdorff Distance (IEEE TMI, 2020)

Three methods to optimize HD as a loss function, achieving 18–45% HD reduction. Read this when you want to improve boundary accuracy specifically.

IEEE Trans Med Imaging. 2020;39(2):499–513

PaperEl Jurdi et al. — Confidence Intervals for Brain MRI Segmentation (Medical Image Analysis, 2025)

How many test cases you need for reliable performance estimates. 100–200 for a 1% CI when variance is low; 1000+ for difficult tasks. Essential for planning validation studies.

Med Image Anal. 2025;103:103565

Tools

Toolmedpy — Medical Image Processing in Python

The most commonly used package for computing Dice, HD95, ASSD, sensitivity, specificity. pip install medpy. Start here for your evaluation pipeline.

Toolsurface-distance (DeepMind) — Surface Metrics

Computes surface distances, NSD, and surface Dice. More precise than medpy for boundary-focused metrics. Use for NSD computation.

Toolmiseval — Standardized Metric Library

Open-source Python package implementing the Müller et al. evaluation guideline. Designed for reproducible, standardized evaluation. pip install miseval.

Advanced Topics

PaperMehrtash et al. — Confidence Calibration for Segmentation (IEEE TMI, 2020)

Shows Dice loss produces overconfident predictions. Model ensembling improves calibration. Critical for clinical deployment where confidence scores matter.

PaperWang et al. — Brain Metastasis Segmentation Meta-Analysis (Radiother Oncol, 2024)

Meta-analysis of 42 studies: pooled lesion-wise Dice 79%, patient-wise sensitivity 86%, lesion-wise sensitivity 87%. The reference for understanding brain metastasis evaluation.

PaperSherer et al. — Metrics for Auto-Segmentation: A Critical Review (Radiother Oncol, 2021)

Shows that common geometric indices are not well correlated with clinical endpoints. Argues for multi-domain evaluation including dosimetric metrics and physician assessment.

← Week 06: Dataset Preparation Week 08: Model Optimization →