You’ve trained your first models. Now you need to know if they’re actually good — and “good” is more complicated than a single number. This week you’ll learn every metric used to evaluate brain tumor segmentation, what each one rewards and penalizes, and the pitfalls that make experienced researchers misinterpret their own results.
Imagine two brain tumor segmentation models. Model A achieves a Dice score of 0.88. Model B achieves 0.87. Model A is better, right? Not necessarily. Model A might have excellent overlap with the tumor but terrible boundary accuracy — its predictions are blobby and imprecise. Model B might have slightly less overlap but much sharper, more clinically useful boundaries. A radiation oncologist planning treatment would prefer Model B every time.
This is why the BraTS challenge uses multiple metrics, and why understanding what each one measures is essential. A landmark 2024 Nature Methods paper by the Metrics Reloaded consortium documented dozens of pitfalls in how researchers choose and interpret metrics, and a 2018 Nature Communications study showed that different metrics produce different challenge winners — meaning the “best” algorithm depends entirely on how you define “best.”
Dice is the most widely used segmentation metric and the primary ranking metric in BraTS. It measures the overlap between your predicted segmentation and the ground truth.
A Dice of 0.85 means that 85% of the combined volume of prediction and ground truth overlaps. For BraTS, state-of-the-art models achieve roughly: WT: 0.90–0.93 (whole tumor is large and well-defined), TC: 0.87–0.92 (tumor core is medium-sized), ET: 0.82–0.87 (enhancing tumor is smallest and hardest). Human inter-rater agreement is 0.74–0.85, so any model above ~0.85 is performing at or above expert-level agreement.
During training, soft Dice operates on continuous probability maps (0.0–1.0) so it’s differentiable and can be used as a loss function. During evaluation, hard Dice is computed on binarized predictions (each voxel is either 0 or 1). This distinction matters: a model optimized with soft Dice loss directly optimizes for the evaluation metric, which is why Dice-based losses outperform cross-entropy when the evaluation metric is Dice.
While Dice measures overall volume overlap, the Hausdorff Distance measures the worst-case boundary error — how far the predicted boundary deviates from the true boundary at its worst point. It catches errors that Dice misses: a prediction that is mostly correct but has one wildly wrong region.
The standard Hausdorff Distance is extremely sensitive to single outlier predictions — one misclassified voxel 50mm away from the tumor boundary will produce an HD of 50mm, even if 99.99% of the prediction is perfect. This is why BraTS uses HD95 — the 95th percentile of surface distances, which is robust to isolated outliers while still capturing systematic boundary errors.
HD95 < 5mm: Excellent. 95% of the boundary is within 5mm of truth. Generally clinically acceptable for treatment planning. Top BraTS models achieve this for WT and TC.
HD95 = 5–15mm: Moderate. Noticeable boundary errors that could affect surgical margins or radiation target volumes. Typical for ET (smaller, more irregular).
HD95 > 20mm: Poor. Substantial boundary errors that would likely impact clinical decisions. Indicates the model is failing in certain regions.
A study proposed three methods to directly optimize Hausdorff Distance as a loss function, achieving 18–45% HD reduction without degrading Dice. This matters because Dice and HD are not strongly correlated — improving one doesn’t automatically improve the other.
NSD is a newer metric adopted by recent BraTS challenges that answers a more clinically meaningful question: what fraction of the predicted surface is within an acceptable tolerance of the true surface?
The tolerance τ is the key parameter. Setting τ = 2mm means “we don’t care about boundary errors smaller than 2mm.” This is clinically motivated — a 1mm boundary error is within the range of human inter-rater variability and has no practical impact on treatment planning. NSD essentially filters out noise in the boundary assessment.
NSD can reveal problems that Dice hides and vice versa. A prediction with excellent overall overlap (high Dice) but poor boundary definition (blurry, imprecise edges) will have a high Dice but low NSD. Conversely, a prediction with razor-sharp boundaries but systematic under-segmentation (the predicted region is entirely contained within the ground truth but much smaller) will have decent NSD but poor Dice. Both perspectives are needed for a complete picture. Top brain metastases detection models achieve NSD of 0.99 alongside Dice of 0.89–0.90.
These familiar classification metrics are applied voxel-by-voxel in segmentation. Each voxel is treated as a binary classification: tumor or not-tumor.
Of all the voxels that are truly tumor, what fraction did the model correctly identify? High sensitivity = few missed tumor voxels. Clinically critical — missing part of a tumor can lead to under-treatment. A model with low sensitivity is “leaving tumor behind.”
Of all the voxels the model predicted as tumor, what fraction are actually tumor? High precision = few false alarms. Clinically important because false positives can lead to unnecessary treatment of healthy tissue. The precision-recall trade-off is fundamental: you can increase sensitivity by predicting more voxels as tumor, but precision drops.
Of all the non-tumor voxels, what fraction did the model correctly identify as non-tumor? In brain tumor segmentation, specificity is almost always > 0.99 because the background (non-tumor brain) vastly outnumbers the tumor. This makes specificity a poor discriminator between algorithms — every model gets it right on the background. A 2009 study showed specificity incorporates image background properties that obscure true segmentation quality.
In clinical practice, what matters most is often not the pixel-perfect boundary but the overall volume of the tumor. Volumetric metrics measure whether your segmentation gets the total tumor size right, which directly feeds into treatment response assessment.
The RANO criteria (Week 1) classify tumors as responding, stable, or progressing based on size changes. Traditional RANO used 2D diameter measurements, but RANO 2.0 now includes volumetric assessment. The landmark Kickingereder et al. study (Lancet Oncology, 2019) showed that volumetric measurements are more reliable than 2D measurements because brain tumors grow in complex, non-spherical shapes. For brain metastases, a 30% unidimensional reduction corresponds to roughly 65% volumetric reduction, and volumetric changes of ≥20% are reproducible between readers.
A study of RANO-based assessment on lower-grade gliomas found poor-to-moderate inter-operator reproducibility (correlation r = 0.28–0.82, accuracy 21.0%) with traditional 2D measurements. Automated volumetric segmentation directly addresses this problem.
Absolute Volume Difference: |Vpred − Vtruth| in mm³ or mL. Simple and clinically interpretable.
Relative Volume Difference: (Vpred − Vtruth) / Vtruth. Positive = over-segmentation, negative = under-segmentation. A model with 0% relative volume difference and 0.85 Dice has the right total volume but imperfect spatial placement.
Volume Correlation: Correlation between predicted and true volumes across patients. High correlation (> 0.9) means the model tracks size changes reliably, even if individual predictions aren’t pixel-perfect.
Understanding exactly how BraTS computes your score is essential for competition strategy. Here’s the complete evaluation framework:
Your model predicts individual label classes (NCR, ED, ET), but BraTS evaluates on three nested regions that combine those classes. Every metric is computed separately for each region:
| Region | Composed Of | Labels | Clinical Meaning | Typical Dice |
|---|---|---|---|---|
| Whole Tumor (WT) | NCR + ED + ET | {1, 2, 4} | Total disease extent | 0.90–0.93 |
| Tumor Core (TC) | NCR + ET | {1, 4} | Solid tumor mass | 0.87–0.92 |
| Enhancing Tumor (ET) | ET only | {4} | Active, growing tumor | 0.82–0.87 |
For each of the three regions, BraTS computes: Dice Similarity Coefficient, HD95 (95th percentile Hausdorff Distance), and in recent iterations Normalized Surface Distance, plus sensitivity and specificity. That’s up to 15 numbers per patient.
Participants are ranked based on aggregate performance across all metrics and regions. However, a Nature Communications study demonstrated that rankings are not robust to changes in test data, ranking scheme, or annotators. Different weighting of metrics and regions can produce different winners from the same results. This means marginal improvements (<0.5% Dice) are often within noise.
For brain metastases (BraTS-METS), voxel-wise Dice isn’t enough. A patient might have 15 small metastases. A model could achieve a high patient-wise Dice by correctly segmenting the 5 largest lesions while completely missing 10 small ones. Clinically, those missed lesions could be catastrophic.
Lesion-wise metrics address this by evaluating each individual lesion separately:
Lesion-wise sensitivity: Of all true lesions, what fraction was detected? A systematic review of 42 studies reported pooled lesion-wise sensitivity of 87%. The best current models achieve 98% sensitivity internally and 97.4% externally, with sensitivity of 93.3% even for tiny lesions <3mm.
False positive rate per patient: How many “phantom” lesions does the model hallucinate? Top models achieve just 0.6 false positives per patient.
Lesion-wise Dice: Dice computed per individual lesion, then averaged. Pooled lesion-wise Dice across 42 studies was 79% — lower than patient-wise Dice because small lesions drag the average down.
Size-stratified evaluation is critical: models typically perform well on large metastases (>12mm: sensitivity 98%, FPR 0.3) but struggle with small ones. In one study, the model even detected 7 lesions that human readers had missed during manual delineation.
A critical review found that commonly used geometric indices like DSC are not well correlated with clinically meaningful endpoints. A study on stroke lesion segmentation showed that nnU-Net achieved excellent segmentation metrics but failed to detect therapy-induced volume reductions, leading to false-negative study outcomes. Two segmentations with similar Dice can produce significantly different medical treatment results depending on whether they over-contour or under-contour the tumor.
Reporting a single “mean Dice” across all patients hides critical variation. A patient with a 100mL tumor and a patient with a 2mL tumor contribute equally to the mean, but their clinical significance and segmentation difficulty are vastly different. Always report per-region metrics and consider median (more robust to outliers) alongside mean. Better yet, report confidence intervals — a 2025 study showed you need 100–200 test samples for a reliable 1% confidence interval.
Dice has an intrinsic region-size bias. For large structures (whole tumor), even a sloppy prediction achieves a respectable Dice because the large overlapping volume dominates. For small structures (enhancing tumor), a small boundary error causes a large Dice drop. This is why WT Dice is always highest and ET Dice is always lowest — it’s partly the metric, not just the model.
If you’re deploying a model clinically, you need to trust its confidence scores, not just its binary predictions. A study showed that models trained with Dice loss produce overconfident, poorly calibrated predictions. A model says “90% confident this is tumor” but is only right 70% of the time. Model ensembling (what nnU-Net does) improves calibration. For serious clinical use, calibration metrics (Expected Calibration Error) should be tracked alongside Dice and HD95.
The “same” metric can produce different numbers depending on the implementation. Hausdorff Distance is particularly sensitive: choices about connectivity (6-connected vs 26-connected in 3D), voxel spacing handling, and edge cases (what happens when a class is absent) all affect the result. A 2015 paper proposed an efficient algorithm that outperforms the standard ITK implementation. Use the official BraTS evaluation toolkit for challenge submissions to ensure your numbers match the leaderboard’s.
Saying “Model A got 0.88 Dice and Model B got 0.87 Dice, so Model A is better” is not science. You need to determine whether the difference is statistically significant and practically meaningful.
Paired comparisons: Use the Wilcoxon signed-rank test (non-parametric, doesn’t assume normality, appropriate for Dice scores). Both models must be evaluated on the same test cases.
Confidence intervals: Bootstrap confidence intervals are reliable without distributional assumptions. A 2025 study showed that parametric CIs are reasonable approximations of bootstrap estimates for segmentation metrics.
Multiple comparisons: If comparing more than two models, apply Bonferroni or Holm correction to control the false discovery rate. Comparing 10 models at p=0.05 without correction means you’ll find ~1 false positive by chance.
Effect size: Report effect sizes alongside p-values. A statistically significant difference of 0.002 Dice points may not be clinically meaningful.
Sample size: A statistical power calculation for segmentation showed that the predicted sample size was accurate within 4 subjects of Monte Carlo estimates. For a 1% wide CI, plan for 100–200 test cases when variance is low.
import numpy as np
from medpy.metric.binary import dc, hd95, sensitivity, specificity
import nibabel as nib
# Load prediction and ground truth
pred = nib.load('prediction.nii.gz').get_fdata()
gt = nib.load('ground_truth.nii.gz').get_fdata()
spacing = nib.load('prediction.nii.gz').header.get_zooms()
# Compute BraTS regions
regions = {
'WT': (lambda x: np.isin(x, [1, 2, 4])),
'TC': (lambda x: np.isin(x, [1, 4])),
'ET': (lambda x: x == 4),
}
for name, region_fn in regions.items():
p = region_fn(pred).astype(np.uint8)
g = region_fn(gt).astype(np.uint8)
if g.sum() == 0 and p.sum() == 0:
print(f"{name}: Both empty (Dice=1.0, HD95=0.0)")
continue
if g.sum() == 0 or p.sum() == 0:
print(f"{name}: One side empty (Dice=0.0)")
continue
dice = dc(p, g)
hausdorff = hd95(p, g, voxelspacing=spacing)
sens = sensitivity(p, g)
print(f"{name}: Dice={dice:.4f}, HD95={hausdorff:.2f}mm, "
f"Sens={sens:.4f}")
medpy: pip install medpy — Provides Dice, Jaccard, HD, HD95, ASSD, sensitivity, specificity. The most commonly used package.
surface-distance (DeepMind): pip install surface-distance — Computes surface distances, NSD, and surface Dice. More accurate for boundary metrics.
miseval: pip install miseval — Standardized metric library following the Müller et al. guideline. Designed for reproducible evaluation.
SimpleITK: Built-in OverlapMeasuresImageFilter for Dice, Jaccard, volume similarity, false positive/negative rates.
| Metric | Measures | Range | Strengths | Limitations |
|---|---|---|---|---|
| Dice (DSC) | Volumetric overlap | 0–1 (↑) | Simple, F1-equivalent, primary BraTS metric | Size-biased, non-linear with acceptability |
| HD95 | Worst-case boundary error (95th %ile) | 0–∞ mm (↓) | Catches boundary errors Dice misses, robust to outliers | Ignores interior accuracy |
| NSD@τ | Surface accuracy within tolerance | 0–1 (↑) | Clinically interpretable tolerance | Requires τ selection |
| Sensitivity | True positive rate | 0–1 (↑) | Captures missed tumor | Ignores false positives |
| Precision | Positive predictive value | 0–1 (↑) | Captures false alarms | Ignores missed tumor |
| Specificity | True negative rate | 0–1 (↑) | Standard classification metric | Inflated by large background; poor discriminator |
| Volume Diff | Total volume error | 0–∞ (↓) | Directly clinically relevant (RANO) | Insensitive to spatial errors |
| Lesion Sensitivity | Per-lesion detection rate | 0–1 (↑) | Essential for metastases | Requires lesion matching |
pip install medpy. Start here for your evaluation pipeline.pip install miseval.