Week 04: CNN Fundamentals & Intro to nnU-Net

// 4.1 The Building Blocks

What Is a Convolutional Neural Network?

A convolutional neural network (CNN) is a type of neural network specifically designed to process grid-structured data — images. Instead of connecting every input pixel to every neuron (which would require billions of connections for a brain MRI), CNNs use a clever trick: small, sliding filters that scan across the image to detect patterns. These filters share their weights everywhere they look, dramatically reducing the number of parameters while making the network naturally good at recognizing patterns regardless of where they appear.

Think of it like this: instead of memorizing the entire image, the CNN learns a set of “pattern detectors.” Early layers learn simple patterns (edges, bright spots, dark lines), and deeper layers combine those simple patterns into complex ones (a ring of enhancement around a necrotic core, the fuzzy boundary of edema). This hierarchical feature extraction is what makes CNNs so powerful for medical imaging.

The Core Components

Convolution Layers

The heart of a CNN. A small filter (typically 3×3 pixels in 2D, or 3×3×3 voxels in 3D) slides across the image, computing a dot product at each position. Each filter learns to detect a specific pattern. The output — called a feature map — highlights where that pattern appears in the image. A layer may have 16, 32, 64, or more filters running in parallel, each detecting different features.

In 3D medical imaging, convolutions operate on volumes, meaning a 3×3×3 filter examines a small cube of voxels at a time. This captures spatial relationships in all three dimensions — critical for brain MRI where tumors are 3D structures.

Activation Functions (ReLU)

After each convolution, an activation function introduces non-linearity — without it, stacking multiple convolution layers would be mathematically equivalent to a single layer, and the network couldn’t learn complex patterns. ReLU (Rectified Linear Unit) is the most common: it simply sets all negative values to zero. LeakyReLU (used by nnU-Net) lets a small fraction of negative values through, which prevents “dead neurons” — neurons that stop learning because they always output zero.

Pooling Layers

These downsample the feature maps, reducing spatial resolution by a factor of 2 (typically). Max pooling keeps the highest value in each small region; average pooling takes the mean. Downsampling serves two purposes: it reduces computation, and it increases the receptive field — how much of the original image each neuron “sees.” This is how deeper layers capture larger-scale context (the overall shape of a tumor) rather than just local details (a single edge).

In nnU-Net, strided convolutions (convolutions with a step size of 2) are used instead of pooling layers, which achieves downsampling while learning the downsampling operation rather than using a fixed rule.

Normalization Layers

These stabilize training by normalizing the internal activations of the network. Batch normalization normalizes across the mini-batch dimension — it works well with large batch sizes but breaks down when batches are small (common in 3D medical imaging where GPU memory limits batch size to 1–2). Instance normalization normalizes each sample independently and is often better for medical imaging — studies show it consistently outperforms batch normalization on medical image segmentation tasks. nnU-Net uses instance normalization by default.

💡

Why instance normalization for medical imaging? With 3D brain volumes, your batch size is often just 1 or 2 due to GPU memory limits. Batch normalization computes statistics across the batch — with batch size 1, those statistics are meaningless. Instance normalization computes statistics per sample, so it works regardless of batch size. Additionally, batch normalization with Dice loss can produce poorly calibrated, overconfident predictions.

// 4.2 The Architecture That Changed Everything

U-Net: The Backbone of Medical Segmentation

If there is one architecture you must understand, it’s U-Net. Published in 2015 by Ronneberger, Fischer, and Brox, it revolutionized medical image segmentation and remains the most widely used architecture in the field nearly a decade later. A 2024 IEEE TPAMI review called it “the most widespread image segmentation architecture” across all medical imaging modalities. Every tool you’ve explored — MONAI, nnU-Net, Auto3DSeg — is built on the U-Net foundation.

The Encoder-Decoder Structure

U-Net has a distinctive U-shaped architecture with two halves:

Encoder (Contracting Path)

The left side of the U. Repeated blocks of convolution + ReLU + downsampling progressively shrink the spatial resolution while increasing the number of feature channels. A 240×240 input might become 120×120, then 60×60, then 30×30. Each level captures increasingly abstract, high-level features — from “edges” to “tumor-like patterns.” This is the context arm: it understands what is in the image.

Decoder (Expanding Path)

The right side of the U. Upsampling operations progressively restore the spatial resolution: 30×30 back to 60×60, then 120×120, then 240×240. At each level, the decoder combines upsampled features with corresponding encoder features through skip connections. The final output is a per-voxel probability map the same size as the input. This is the localization arm: it determines where things are.

Skip Connections: The Key Innovation

The single most important feature of U-Net is the skip connections that directly connect each encoder level to the corresponding decoder level. Without them, the decoder would have to reconstruct fine spatial details from only the highly compressed bottleneck representation — like trying to draw a detailed map of a city from only the knowledge that “it’s a city with a river.”

Skip connections pass the high-resolution, low-level features (exact edges, textures, boundaries) directly to the decoder, which combines them with the high-level semantic understanding from the upsampling path. This is why U-Net produces such sharp segmentation boundaries — it has access to both the “what” (encoder features) and the “where” (skip connection features) simultaneously.

📚

Why U-Net was revolutionary for medicine: Medical datasets are tiny by deep learning standards. U-Net was explicitly designed to work with very few training images. Its symmetric architecture with heavy data augmentation can learn effective segmentation from just dozens of examples — exactly the situation in medical imaging where expert annotations are expensive and scarce.

UNet++ and Beyond

Since 2015, many variants have been proposed. UNet++ (Zhou et al., 2020) redesigned skip connections to aggregate features at multiple semantic scales and allows efficient ensembling of U-Nets at different depths. Attention U-Net adds gating mechanisms to skip connections so the decoder can focus on relevant regions. However, a critical benchmarking study (Gut et al., 2022) compared U-Net against five extensions under identical conditions across nine datasets and found that architectural variants don’t consistently improve over basic U-Net while resource demands increase. The takeaway: U-Net itself is not the bottleneck — how you configure everything around it matters more.

// 4.3 Dimensionality Matters

2D, 2.5D, and 3D: Processing Brain Volumes

Brain MRIs are 3D volumes, but you have three fundamentally different ways to process them with a CNN. This is one of the most important practical decisions in medical image segmentation.

2D U-Net (Slice by Slice)

Treat each 2D slice of the 3D volume independently. The model sees one slice at a time — like looking at a single page of a book. Pros: Low GPU memory, fast training, can use pre-trained 2D models. Cons: Ignores spatial continuity between slices — the model doesn’t know that the tumor in slice 42 is connected to the tumor in slice 43. Can produce “flickering” predictions that jump between slices. Still surprisingly effective — studies report Dice scores as high as 0.990 for some tasks.

3D U-Net (Full Volume)

Process the entire 3D volume (or large 3D patches) with 3D convolutions. Each filter is a 3×3×3 cube that captures spatial relationships in all three dimensions. Pros: Captures full 3D context, smoother predictions across slices, generally better performance (Dice 0.925 vs 0.902 in one cardiac study). Cons: Much higher GPU memory (a 3D volume has orders of magnitude more voxels than a 2D slice), requires patch-based training for large images, slower training.

2.5D (Multi-Slice Input)

A compromise: feed several adjacent slices (typically 3–5) as multi-channel input to a 2D network. The model gets some inter-slice context without the full memory cost of 3D. A large-scale empirical study found that 2.5D methods consistently improve over 2D baselines, but 3D CNNs are not always the best choice — performance depends on the specific dataset and task. One study on brain metastases found 2.5D had better detection rates (79% vs 71%) while 3D had fewer false positives.

💡

What nnU-Net does: It doesn’t choose for you based on theory — it trains all applicable configurations (2D, 3D full-resolution, and optionally a 3D cascade) and empirically selects the best one based on cross-validation performance. For brain tumor segmentation on BraTS data, the 3D full-resolution configuration typically wins because the volumes are manageable in size (240×240×155) and 3D context is important for distinguishing tumor sub-regions.

// 4.4 How the Model Learns

Loss Functions: Teaching the Model What “Wrong” Means

A loss function measures how bad the model’s predictions are. During training, the optimizer adjusts the model’s weights to minimize this loss. The choice of loss function profoundly affects what the model learns — different losses reward different behaviors.

The Class Imbalance Problem

Before diving into specific losses, understand the core challenge: in brain tumor segmentation, the tumor is tiny compared to the brain. An enhancing tumor might occupy 1–2% of the total brain volume. If the model predicts “no tumor anywhere,” it’s correct for 98% of voxels. A naive loss function would reward this lazy behavior. Every loss function choice in medical segmentation is, at its core, a strategy for handling this imbalance.

Cross-Entropy Loss

The standard classification loss, applied independently to each voxel. It penalizes confident wrong predictions heavily. Problem: It treats every voxel equally, so the overwhelming majority of “background” voxels dominate the loss. The model can minimize cross-entropy by getting background right and largely ignoring the tumor. Strength: Provides stable gradients everywhere, even when predictions are far from ground truth, which helps training converge.

Dice Loss

Directly optimizes the Dice similarity coefficient — the same metric used to evaluate BraTS submissions. It measures overlap between prediction and ground truth, and is naturally robust to class imbalance because it normalizes by the size of the prediction and ground truth. A study across six medical segmentation tasks showed that Dice-based losses are superior to cross-entropy when evaluating with Dice Score. Caveat: Dice loss has a theoretical bias toward specific region sizes and can produce poorly calibrated confidence scores.

Combined Dice + Cross-Entropy (What nnU-Net Uses)

The best of both worlds. The Dice component handles class imbalance and directly optimizes the evaluation metric. The cross-entropy component provides stable, well-behaved gradients that help training converge, especially early on. The first large-scale analysis of 20 loss functions across six datasets (Ma et al., 2021) found that compound loss functions are the most robust, with no single loss consistently winning across all tasks. nnU-Net uses an equally-weighted sum of Dice loss and cross-entropy loss.

⚠️

If your model predicts all background: This is the most common beginner failure. It means the class imbalance is overwhelming your loss function. Solutions: switch from pure cross-entropy to Dice loss or Dice + CE, ensure your patch sampling includes foreground voxels (MONAI’s RandCropByPosNegLabeld forces a 1:1 ratio), check that your labels are correctly loaded (a surprisingly common bug), and verify your learning rate isn’t too high.

// 4.5 The Training Recipe

Training Strategies That Actually Work

Knowing the architecture and loss function isn’t enough — how you train matters just as much. These are the key decisions and their evidence-backed defaults.

Optimizer: SGD vs Adam

The optimizer is the algorithm that updates the model’s weights based on the loss gradients. Adam adapts the learning rate per-parameter and is generally easier to use — a systematic evaluation on 534 brain tumor patients found Adam with polynomial decay significantly outperformed other combinations (p<10⁻⁴). SGD with Nesterov momentum is what nnU-Net uses (momentum=0.99). It’s more sensitive to learning rate settings but can find better optima on some tasks. For beginners, Adam is more forgiving. For challenge submissions, follow nnU-Net’s SGD approach.

Learning Rate Schedule

The learning rate controls how large each weight update is. Too high: training is unstable. Too low: training is too slow. The polynomial decay schedule (used by nnU-Net) starts at 0.01 and smoothly decreases toward zero over the course of training. This allows large updates early (rapid learning) and fine adjustments later (polishing). A cosine annealing schedule works similarly and is common in MONAI pipelines.

Data Augmentation

With only ~1,200 BraTS cases, augmentation is critical. nnU-Net applies augmentations on-the-fly during training: random rotations (±30° per axis), scaling (0.7–1.4×), elastic deformations, gamma correction for intensity, Gaussian noise, and mirroring. A systematic review of 300+ articles confirmed augmentation is effective across organs, modalities, and dataset sizes. The key: augmentations must produce plausible medical images. Random color jittering is nonsensical for MRI.

Patch-Based Training

A 240×240×155 brain MRI with 4 channels doesn’t fit in GPU memory alongside a 3D U-Net. The solution: train on random patches (sub-volumes) of the full image, typically 128×128×128 voxels. During inference, use a sliding window across the full volume with overlapping patches, averaging predictions in overlap regions. nnU-Net automatically determines the optimal patch size based on your GPU memory and dataset properties.

Deep Supervision

Instead of only computing loss at the final output, deep supervision computes loss at multiple resolution levels of the decoder. This provides gradient signal directly to earlier layers, helping them learn meaningful features faster. An optimized Residual U-Net with deep supervision achieved a Dice score of 0.9498 on BraTS 2018. nnU-Net uses deep supervision during training but only uses the full-resolution output during inference.

// 4.6 The Gold Standard

Introduction to nnU-Net

Now that you understand the building blocks, you can appreciate what nnU-Net actually does: it takes all of the decisions described above — architecture depth, patch size, batch size, normalization, loss function, augmentation, optimizer, learning rate, post-processing — and automatically configures every single one based on the properties of your dataset. The name “no-new-Net” is the point: there’s no architectural innovation. The innovation is in the systematic optimization of every other component.

Public datasets where nnU-Net surpassed most existing approaches without manual tuning

9/10

MICCAI 2020 challenge winners built their methods on top of nnU-Net

0.93

External validation Dice on real-world clinical brain tumor data from 12 hospitals

The Three-Step Pipeline

STEP 1

Dataset Fingerprinting

nnU-Net automatically analyzes your dataset: image sizes, voxel spacings, intensity distributions, class frequencies, and number of training cases. This “fingerprint” drives all subsequent decisions. You trigger this with nnUNetv2_plan_and_preprocess.

STEP 2

Rule-Based Configuration

Hard-coded heuristic rules translate the fingerprint into a pipeline configuration: resampling target resolution, whether to use 2D or 3D architecture, network depth (how many downsampling stages), number of feature channels per stage, patch size, and batch size — all optimized to fit in GPU memory while maximizing performance.

STEP 3

Empirical Optimization

Some decisions can’t be made by rules alone. nnU-Net trains all applicable configurations with 5-fold cross-validation, then selects the best one (or ensembles multiple). It also empirically determines post-processing (whether to remove small connected components). This is the “trial-and-error” step, done automatically.

What nnU-Net Fixes for You

FIXED

Never Changes

Loss function (Dice + CE), optimizer (SGD, momentum 0.99), activation function (LeakyReLU), normalization (instance norm), training schedule (1000 epochs), augmentation strategy, deep supervision during training.

RULES

Determined by Dataset Fingerprint

Resampling resolution, network topology (depth, feature channels per stage), patch size, batch size, whether to include 3D cascade configuration.

EMPIRICAL

Learned from Cross-Validation

Best configuration (2D vs 3D full-res vs cascade), whether to ensemble configurations, post-processing decisions.

💡

The key insight from ablation studies: No single design choice dominates nnU-Net’s success. It’s the combination of automated preprocessing, architecture configuration, training strategy, and post-processing that yields state-of-the-art performance. This is why swapping in a fancier architecture (SwinUNETR, UNETR++) often doesn’t help — the existing U-Net is not the bottleneck.

// 4.7 After Training

Post-Processing & Ensembling

The raw output of a segmentation model is a probability map — each voxel gets a probability for each class. Converting this to a final segmentation and cleaning it up is post-processing.

Ensembling Across Folds

nnU-Net trains 5 models using 5-fold cross-validation. At inference, predictions from all 5 models are averaged before taking the argmax. This ensemble is more robust than any single model and typically adds 1–2 Dice points. It’s one of the easiest performance boosts available.

Ensembling Across Configurations

If both 2D and 3D configurations perform well, nnU-Net can average their predictions too. This captures complementary information: the 2D model may be better at sharp in-plane boundaries while the 3D model better understands through-plane continuity.

Connected Component Analysis

The model may predict small, isolated “islands” of tumor in regions where no tumor exists (false positives). Connected component analysis identifies these disconnected predictions and removes those below a size threshold. nnU-Net automatically decides whether this post-processing helps based on validation performance.

Test-Time Augmentation (TTA)

Apply augmentations (flipping, rotation) to the input image at inference time, run the model on each augmented version, then average the (un-augmented) predictions. This is like asking the model to look at the same scan from multiple angles and averaging its opinions. Not part of default nnU-Net but commonly used by top BraTS teams for a small additional boost.

// 4.8 Debugging Guide

Common Training Failures & How to Fix Them

Loss isn’t decreasing

Check: Is your learning rate too high? (Try reducing by 10x.) Are your images and labels correctly paired? (Load a sample and visualize the overlay in 3D Slicer.) Is the data normalized? (Raw MRI intensities can be enormous numbers that destabilize training.) Is the model receiving gradient? (Check for NaN values in the loss.)

Model predicts all background

Check: Are you using Dice loss or Dice+CE? (Pure CE often leads to this.) Does your patch sampling include foreground? (If all patches are 99% background, the model learns to predict background.) Are your labels correct? (A label file of all zeros means there’s nothing to learn.) When training with limited data and strong class imbalance, the distribution of model activations can shift, causing systematic under-segmentation of small structures.

GPU out of memory

Check: Reduce patch size (the most effective fix), reduce batch size to 1, use mixed-precision training (torch.cuda.amp), try 2D instead of 3D, or use gradient checkpointing. nnU-Net handles this automatically by selecting patch and batch sizes that fit your GPU.

Good training loss but poor validation

Overfitting. Solutions: increase data augmentation (the cheapest fix), use early stopping or select the best validation checkpoint, reduce model complexity, add dropout or weight decay. Ensemble methods also reduce overfitting. With limited data, even 20 fine-tuning cases from the target domain can substantially recover performance.

// 4.9 Resources & Further Reading

This Week’s Learning Resources

Start Here (Visual & Conceptual)

VideoNeural Networks — 3Blue1Brown (YouTube)

The best visual introduction to neural networks. Explains neurons, gradient descent, and backpropagation with beautiful animations. If you haven’t watched these yet, start here before anything else this week.

VideoU-Net Paper Explained — Yannic Kilcher (YouTube)

Detailed walkthrough of the original U-Net paper. Explains the encoder-decoder, skip connections, and why U-Net works for medical imaging. Watch after 3Blue1Brown.

VideoCoding a U-Net from Scratch — Aladdin Persson (YouTube)

Build every component of U-Net in PyTorch step by step. Do this exercise — building it yourself solidifies understanding far more than just reading about it.

CoursePractical Deep Learning for Coders — fast.ai

Jeremy Howard’s free course. Top-down approach: build working models first, then learn theory. Covers CNNs, training strategies, and segmentation. The most impactful single course for self-learners.

VideoEssence of Linear Algebra — 3Blue1Brown (YouTube)

If matrix multiplication feels mysterious, watch this first. Convolutions, transformations, and everything in neural networks is built on linear algebra. These videos make it visual and intuitive.

Hands-On Practice

ToolOfficial PyTorch Tutorials

Work through “Learn the Basics” and the CNN image classification tutorial. Then modify it for segmentation by changing the output layer. Understanding PyTorch’s tensor operations is essential for both MONAI and nnU-Net.

ToolnnU-Net Official Repository

This week, read the README and documentation/how_to_use_nnunet.md in full. Try running nnUNetv2_plan_and_preprocess on a small dataset and examining the generated nnUNetPlans.json to see what nnU-Net decided about your data.

ToolTrain a U-Net on BraTS — OVHcloud Tutorial

Complete walkthrough: NIfTI handling, modality selection, building a U-Net in PyTorch, training on BraTS 2020, and visualizing predictions. Has a GitHub repo with all the code. Do this tutorial this week.

Key Papers

PaperIsensee et al. — nnU-Net (Nature Methods, 2021)

The nnU-Net paper. Read at minimum the abstract, introduction, and Figure 1 (the pipeline diagram). Understanding this paper is the goal of Weeks 4–6.

Nat Methods. 2021;18(2):203–211

PaperAzad et al. — Medical Image Segmentation: The Success of U-Net (IEEE TPAMI, 2024)

The definitive U-Net review. Organizes all variants into a taxonomy. Read the first sections for a panoramic view of the architecture landscape.

IEEE Trans PAMI. 2024;46(12):10532–10553

PaperMa et al. — Loss Odyssey in Medical Image Segmentation (Medical Image Analysis, 2021)

The first large-scale analysis of 20 loss functions on 3D segmentation. Key finding: compound losses (Dice + CE) are the most robust. Essential context for understanding why nnU-Net uses the loss it does.

Med Image Anal. 2021;71:101854

PaperGut et al. — Benchmarking Deep Architectures (IEEE TMI, 2022)

Fair comparison proving U-Net variants don’t reliably beat basic U-Net. Calibrates expectations: don’t chase architectures, optimize your pipeline.

IEEE Trans Med Imaging. 2022;41(11):3128–3141

PaperLi et al. — Analyzing Overfitting Under Class Imbalance (IEEE TMI, 2021)

Explains why models under-segment small structures and how logit shift occurs with limited data and class imbalance. Critical for understanding brain tumor segmentation failures.

IEEE Trans Med Imaging. 2021;40(3):1065–1077

Deep Dives (Advanced)

PaperZhou et al. — UNet++: Redesigning Skip Connections (IEEE TMI, 2020)

Adds dense skip connections and nested decoder paths to U-Net. Read this to understand how skip connection design can be improved.

PaperBukhari & Mohy-Ud-Din — Learning Rate Policies for Brain Tumor Segmentation (Phys Med Biol, 2021)

Systematic evaluation of 9 optimizer-learning rate pairs on 534 brain tumor patients. Evidence base for why polynomial decay with Adam (or SGD) works best.

PaperLiu et al. — Do We Really Need Dice? Hidden Region-Size Biases (Medical Image Analysis, 2024)

Theoretical analysis of why Dice loss has intrinsic biases. Advanced but illuminating for understanding loss function behavior on imbalanced data.

CourseCS231n: CNNs for Visual Recognition — Stanford

Stanford’s legendary computer vision course. Free lecture notes cover convolution math, backpropagation, training dynamics, and visualization in depth. Tackle this when you want the full theoretical picture.

← Week 03: Open-Source Tools Week 05: nnU-Net Deep Dive →