You’ve seen the tools (Week 3). Now it’s time to understand what’s actually happening inside them. This week builds your deep learning foundations — from how a single convolution works to how nnU-Net assembles an entire self-configuring segmentation pipeline.
A convolutional neural network (CNN) is a type of neural network specifically designed to process grid-structured data — images. Instead of connecting every input pixel to every neuron (which would require billions of connections for a brain MRI), CNNs use a clever trick: small, sliding filters that scan across the image to detect patterns. These filters share their weights everywhere they look, dramatically reducing the number of parameters while making the network naturally good at recognizing patterns regardless of where they appear.
Think of it like this: instead of memorizing the entire image, the CNN learns a set of “pattern detectors.” Early layers learn simple patterns (edges, bright spots, dark lines), and deeper layers combine those simple patterns into complex ones (a ring of enhancement around a necrotic core, the fuzzy boundary of edema). This hierarchical feature extraction is what makes CNNs so powerful for medical imaging.
The heart of a CNN. A small filter (typically 3×3 pixels in 2D, or 3×3×3 voxels in 3D) slides across the image, computing a dot product at each position. Each filter learns to detect a specific pattern. The output — called a feature map — highlights where that pattern appears in the image. A layer may have 16, 32, 64, or more filters running in parallel, each detecting different features.
In 3D medical imaging, convolutions operate on volumes, meaning a 3×3×3 filter examines a small cube of voxels at a time. This captures spatial relationships in all three dimensions — critical for brain MRI where tumors are 3D structures.
After each convolution, an activation function introduces non-linearity — without it, stacking multiple convolution layers would be mathematically equivalent to a single layer, and the network couldn’t learn complex patterns. ReLU (Rectified Linear Unit) is the most common: it simply sets all negative values to zero. LeakyReLU (used by nnU-Net) lets a small fraction of negative values through, which prevents “dead neurons” — neurons that stop learning because they always output zero.
These downsample the feature maps, reducing spatial resolution by a factor of 2 (typically). Max pooling keeps the highest value in each small region; average pooling takes the mean. Downsampling serves two purposes: it reduces computation, and it increases the receptive field — how much of the original image each neuron “sees.” This is how deeper layers capture larger-scale context (the overall shape of a tumor) rather than just local details (a single edge).
In nnU-Net, strided convolutions (convolutions with a step size of 2) are used instead of pooling layers, which achieves downsampling while learning the downsampling operation rather than using a fixed rule.
These stabilize training by normalizing the internal activations of the network. Batch normalization normalizes across the mini-batch dimension — it works well with large batch sizes but breaks down when batches are small (common in 3D medical imaging where GPU memory limits batch size to 1–2). Instance normalization normalizes each sample independently and is often better for medical imaging — studies show it consistently outperforms batch normalization on medical image segmentation tasks. nnU-Net uses instance normalization by default.
If there is one architecture you must understand, it’s U-Net. Published in 2015 by Ronneberger, Fischer, and Brox, it revolutionized medical image segmentation and remains the most widely used architecture in the field nearly a decade later. A 2024 IEEE TPAMI review called it “the most widespread image segmentation architecture” across all medical imaging modalities. Every tool you’ve explored — MONAI, nnU-Net, Auto3DSeg — is built on the U-Net foundation.
U-Net has a distinctive U-shaped architecture with two halves:
The left side of the U. Repeated blocks of convolution + ReLU + downsampling progressively shrink the spatial resolution while increasing the number of feature channels. A 240×240 input might become 120×120, then 60×60, then 30×30. Each level captures increasingly abstract, high-level features — from “edges” to “tumor-like patterns.” This is the context arm: it understands what is in the image.
The right side of the U. Upsampling operations progressively restore the spatial resolution: 30×30 back to 60×60, then 120×120, then 240×240. At each level, the decoder combines upsampled features with corresponding encoder features through skip connections. The final output is a per-voxel probability map the same size as the input. This is the localization arm: it determines where things are.
The single most important feature of U-Net is the skip connections that directly connect each encoder level to the corresponding decoder level. Without them, the decoder would have to reconstruct fine spatial details from only the highly compressed bottleneck representation — like trying to draw a detailed map of a city from only the knowledge that “it’s a city with a river.”
Skip connections pass the high-resolution, low-level features (exact edges, textures, boundaries) directly to the decoder, which combines them with the high-level semantic understanding from the upsampling path. This is why U-Net produces such sharp segmentation boundaries — it has access to both the “what” (encoder features) and the “where” (skip connection features) simultaneously.
Since 2015, many variants have been proposed. UNet++ (Zhou et al., 2020) redesigned skip connections to aggregate features at multiple semantic scales and allows efficient ensembling of U-Nets at different depths. Attention U-Net adds gating mechanisms to skip connections so the decoder can focus on relevant regions. However, a critical benchmarking study (Gut et al., 2022) compared U-Net against five extensions under identical conditions across nine datasets and found that architectural variants don’t consistently improve over basic U-Net while resource demands increase. The takeaway: U-Net itself is not the bottleneck — how you configure everything around it matters more.
Brain MRIs are 3D volumes, but you have three fundamentally different ways to process them with a CNN. This is one of the most important practical decisions in medical image segmentation.
Treat each 2D slice of the 3D volume independently. The model sees one slice at a time — like looking at a single page of a book. Pros: Low GPU memory, fast training, can use pre-trained 2D models. Cons: Ignores spatial continuity between slices — the model doesn’t know that the tumor in slice 42 is connected to the tumor in slice 43. Can produce “flickering” predictions that jump between slices. Still surprisingly effective — studies report Dice scores as high as 0.990 for some tasks.
Process the entire 3D volume (or large 3D patches) with 3D convolutions. Each filter is a 3×3×3 cube that captures spatial relationships in all three dimensions. Pros: Captures full 3D context, smoother predictions across slices, generally better performance (Dice 0.925 vs 0.902 in one cardiac study). Cons: Much higher GPU memory (a 3D volume has orders of magnitude more voxels than a 2D slice), requires patch-based training for large images, slower training.
A compromise: feed several adjacent slices (typically 3–5) as multi-channel input to a 2D network. The model gets some inter-slice context without the full memory cost of 3D. A large-scale empirical study found that 2.5D methods consistently improve over 2D baselines, but 3D CNNs are not always the best choice — performance depends on the specific dataset and task. One study on brain metastases found 2.5D had better detection rates (79% vs 71%) while 3D had fewer false positives.
A loss function measures how bad the model’s predictions are. During training, the optimizer adjusts the model’s weights to minimize this loss. The choice of loss function profoundly affects what the model learns — different losses reward different behaviors.
Before diving into specific losses, understand the core challenge: in brain tumor segmentation, the tumor is tiny compared to the brain. An enhancing tumor might occupy 1–2% of the total brain volume. If the model predicts “no tumor anywhere,” it’s correct for 98% of voxels. A naive loss function would reward this lazy behavior. Every loss function choice in medical segmentation is, at its core, a strategy for handling this imbalance.
The standard classification loss, applied independently to each voxel. It penalizes confident wrong predictions heavily. Problem: It treats every voxel equally, so the overwhelming majority of “background” voxels dominate the loss. The model can minimize cross-entropy by getting background right and largely ignoring the tumor. Strength: Provides stable gradients everywhere, even when predictions are far from ground truth, which helps training converge.
Directly optimizes the Dice similarity coefficient — the same metric used to evaluate BraTS submissions. It measures overlap between prediction and ground truth, and is naturally robust to class imbalance because it normalizes by the size of the prediction and ground truth. A study across six medical segmentation tasks showed that Dice-based losses are superior to cross-entropy when evaluating with Dice Score. Caveat: Dice loss has a theoretical bias toward specific region sizes and can produce poorly calibrated confidence scores.
The best of both worlds. The Dice component handles class imbalance and directly optimizes the evaluation metric. The cross-entropy component provides stable, well-behaved gradients that help training converge, especially early on. The first large-scale analysis of 20 loss functions across six datasets (Ma et al., 2021) found that compound loss functions are the most robust, with no single loss consistently winning across all tasks. nnU-Net uses an equally-weighted sum of Dice loss and cross-entropy loss.
RandCropByPosNegLabeld forces a 1:1 ratio), check that your labels are correctly loaded (a surprisingly common bug), and verify your learning rate isn’t too high.Knowing the architecture and loss function isn’t enough — how you train matters just as much. These are the key decisions and their evidence-backed defaults.
The optimizer is the algorithm that updates the model’s weights based on the loss gradients. Adam adapts the learning rate per-parameter and is generally easier to use — a systematic evaluation on 534 brain tumor patients found Adam with polynomial decay significantly outperformed other combinations (p<10⁻⁴). SGD with Nesterov momentum is what nnU-Net uses (momentum=0.99). It’s more sensitive to learning rate settings but can find better optima on some tasks. For beginners, Adam is more forgiving. For challenge submissions, follow nnU-Net’s SGD approach.
The learning rate controls how large each weight update is. Too high: training is unstable. Too low: training is too slow. The polynomial decay schedule (used by nnU-Net) starts at 0.01 and smoothly decreases toward zero over the course of training. This allows large updates early (rapid learning) and fine adjustments later (polishing). A cosine annealing schedule works similarly and is common in MONAI pipelines.
With only ~1,200 BraTS cases, augmentation is critical. nnU-Net applies augmentations on-the-fly during training: random rotations (±30° per axis), scaling (0.7–1.4×), elastic deformations, gamma correction for intensity, Gaussian noise, and mirroring. A systematic review of 300+ articles confirmed augmentation is effective across organs, modalities, and dataset sizes. The key: augmentations must produce plausible medical images. Random color jittering is nonsensical for MRI.
A 240×240×155 brain MRI with 4 channels doesn’t fit in GPU memory alongside a 3D U-Net. The solution: train on random patches (sub-volumes) of the full image, typically 128×128×128 voxels. During inference, use a sliding window across the full volume with overlapping patches, averaging predictions in overlap regions. nnU-Net automatically determines the optimal patch size based on your GPU memory and dataset properties.
Instead of only computing loss at the final output, deep supervision computes loss at multiple resolution levels of the decoder. This provides gradient signal directly to earlier layers, helping them learn meaningful features faster. An optimized Residual U-Net with deep supervision achieved a Dice score of 0.9498 on BraTS 2018. nnU-Net uses deep supervision during training but only uses the full-resolution output during inference.
Now that you understand the building blocks, you can appreciate what nnU-Net actually does: it takes all of the decisions described above — architecture depth, patch size, batch size, normalization, loss function, augmentation, optimizer, learning rate, post-processing — and automatically configures every single one based on the properties of your dataset. The name “no-new-Net” is the point: there’s no architectural innovation. The innovation is in the systematic optimization of every other component.
nnU-Net automatically analyzes your dataset: image sizes, voxel spacings, intensity distributions, class frequencies, and number of training cases. This “fingerprint” drives all subsequent decisions. You trigger this with nnUNetv2_plan_and_preprocess.
Hard-coded heuristic rules translate the fingerprint into a pipeline configuration: resampling target resolution, whether to use 2D or 3D architecture, network depth (how many downsampling stages), number of feature channels per stage, patch size, and batch size — all optimized to fit in GPU memory while maximizing performance.
Some decisions can’t be made by rules alone. nnU-Net trains all applicable configurations with 5-fold cross-validation, then selects the best one (or ensembles multiple). It also empirically determines post-processing (whether to remove small connected components). This is the “trial-and-error” step, done automatically.
Loss function (Dice + CE), optimizer (SGD, momentum 0.99), activation function (LeakyReLU), normalization (instance norm), training schedule (1000 epochs), augmentation strategy, deep supervision during training.
Resampling resolution, network topology (depth, feature channels per stage), patch size, batch size, whether to include 3D cascade configuration.
Best configuration (2D vs 3D full-res vs cascade), whether to ensemble configurations, post-processing decisions.
The raw output of a segmentation model is a probability map — each voxel gets a probability for each class. Converting this to a final segmentation and cleaning it up is post-processing.
nnU-Net trains 5 models using 5-fold cross-validation. At inference, predictions from all 5 models are averaged before taking the argmax. This ensemble is more robust than any single model and typically adds 1–2 Dice points. It’s one of the easiest performance boosts available.
If both 2D and 3D configurations perform well, nnU-Net can average their predictions too. This captures complementary information: the 2D model may be better at sharp in-plane boundaries while the 3D model better understands through-plane continuity.
The model may predict small, isolated “islands” of tumor in regions where no tumor exists (false positives). Connected component analysis identifies these disconnected predictions and removes those below a size threshold. nnU-Net automatically decides whether this post-processing helps based on validation performance.
Apply augmentations (flipping, rotation) to the input image at inference time, run the model on each augmented version, then average the (un-augmented) predictions. This is like asking the model to look at the same scan from multiple angles and averaging its opinions. Not part of default nnU-Net but commonly used by top BraTS teams for a small additional boost.
Check: Is your learning rate too high? (Try reducing by 10x.) Are your images and labels correctly paired? (Load a sample and visualize the overlay in 3D Slicer.) Is the data normalized? (Raw MRI intensities can be enormous numbers that destabilize training.) Is the model receiving gradient? (Check for NaN values in the loss.)
Check: Are you using Dice loss or Dice+CE? (Pure CE often leads to this.) Does your patch sampling include foreground? (If all patches are 99% background, the model learns to predict background.) Are your labels correct? (A label file of all zeros means there’s nothing to learn.) When training with limited data and strong class imbalance, the distribution of model activations can shift, causing systematic under-segmentation of small structures.
Check: Reduce patch size (the most effective fix), reduce batch size to 1, use mixed-precision training (torch.cuda.amp), try 2D instead of 3D, or use gradient checkpointing. nnU-Net handles this automatically by selecting patch and batch sizes that fit your GPU.
Overfitting. Solutions: increase data augmentation (the cheapest fix), use early stopping or select the best validation checkpoint, reduce model complexity, add dropout or weight decay. Ensemble methods also reduce overfitting. With limited data, even 20 fine-tuning cases from the target domain can substantially recover performance.
documentation/how_to_use_nnunet.md in full. Try running nnUNetv2_plan_and_preprocess on a small dataset and examining the generated nnUNetPlans.json to see what nnU-Net decided about your data.