Last week was the “why” of clinical integration. This week is the “how” — the actual tools, containers, APIs, and pipelines that take a trained model out of a Jupyter notebook and make it usable by researchers, collaborators, and eventually clinicians.
There isn’t one way to deploy a segmentation model. The right approach depends on your audience, your infrastructure, and your goals. This week covers four levels, from simplest to most clinical.
An interactive web app where colleagues can upload a NIfTI file and see the segmentation. Takes hours to build, runs on a laptop. Great for papers, grant demos, and collaborator feedback. Not for clinical use.
Your model, environment, and dependencies in a Docker container with a command-line interface. Anyone can run inference with one command. TotalSegmentator popularized this model — TotalSegmentator -i input.nii.gz -o output. Reproducible, shareable, publishable.
Run multiple segmentation algorithms, fuse their predictions, and produce a standardized result. The BraTS Toolkit wraps the full pipeline: preprocessing, multi-algorithm segmentation, and fusion. Used for research benchmarking and multi-model ensembling.
Your model runs as a DICOM-in, DICOM-out service integrated into the hospital PACS. Radiologists see the segmentation in their normal viewer. Requires DICOM handling, GPU infrastructure, monitoring, and regulatory compliance. The ultimate goal.
The fastest way to make your model accessible. Streamlit turns a Python script into an interactive web application. In under 100 lines of code, you can build an app where users upload a brain MRI, your model runs inference, and the segmentation is displayed as an interactive overlay with axial/sagittal/coronal sliders.
# Minimal Streamlit app for brain tumor segmentation
import streamlit as st
import nibabel as nib
import numpy as np
import torch
st.title("Brain Tumor Segmentation Demo")
uploaded = st.file_uploader("Upload NIfTI (T1ce)", type=["nii", "gz"])
if uploaded:
# Load image
img = nib.load(uploaded).get_fdata()
# Run model (your trained nnU-Net/MONAI model)
prediction = run_inference(img) # Your function here
# Display with interactive slice selector
axis = st.selectbox("View", ["Axial", "Sagittal", "Coronal"])
slice_idx = st.slider("Slice", 0, img.shape[2]-1, img.shape[2]//2)
# Overlay segmentation on image
fig = create_overlay(img, prediction, slice_idx, axis)
st.pyplot(fig)
# Display metrics
vol_ml = np.sum(prediction > 0) * voxel_volume / 1000
st.metric("Tumor Volume", f"{vol_ml:.1f} mL")
Streamlit Cloud: Free hosting for public apps. Limited to CPU (no GPU), so inference will be slow for 3D models. Good for demos with lightweight 2D models or pre-computed results. Hugging Face Spaces: Free hosting with optional GPU access (A10G, T4). Better for real-time inference. Institutional server: Run on a GPU workstation behind your university’s firewall. Best for processing real (non-public) data. FastAPI is the modern alternative when you need API-style deployment instead of a visual interface.
Docker packages your model, Python environment, CUDA drivers, and all dependencies into a container that runs identically on any machine. This solves the “works on my laptop” problem. It’s the foundation for MONAI Deploy (MAPs are Docker containers), challenge submissions (BraTS requires Docker), and clinical deployment.
# Dockerfile for nnU-Net inference
FROM nvcr.io/nvidia/pytorch:23.10-py3
# Install nnU-Net
RUN pip install nnunetv2
# Copy trained model weights
COPY ./nnUNet_results /opt/nnUNet_results
# Set environment variables
ENV nnUNet_results=/opt/nnUNet_results
ENV nnUNet_raw=/opt/nnUNet_raw
ENV nnUNet_preprocessed=/opt/nnUNet_preprocessed
# Entrypoint: run inference
ENTRYPOINT ["nnUNetv2_predict"]
# Build and run
docker build -t brats-segmentation .
docker run --gpus all \
-v /path/to/input:/input \
-v /path/to/output:/output \
brats-segmentation \
-i /input -o /output -d 001 -c 3d_fullres
TotalSegmentator is the gold standard for how to package an nnU-Net model for practical use. It segments 80 anatomical structures from a single command: TotalSegmentator -i input.nii.gz -o output. Under the hood, it’s nnU-Net in a pip-installable Python package. The MRI version achieved Dice 0.839 for 80 structures and nearly matched the CT version (0.966 vs 0.970). It’s available as both a command-line tool and a web app at totalsegmentator.com. This is the deployment pattern to emulate for your own models.
pip install nnunetv2 — use pip install nnunetv2==2.5.1). Include health checks. Log all predictions for audit trails. Tag images with the model version and training date. The published “Ten Simple Rules for Writing Dockerfiles” is required reading.The BraTS Toolkit is a three-component system that standardizes the entire brain tumor segmentation workflow. It was developed specifically to bridge the gap between challenge-winning algorithms and practical use.
Standardizes raw brain MRI data through the full pipeline: DICOM-to-NIfTI conversion, co-registration of all four modalities to a common space, atlas registration to SRI24 space, skull stripping, and intensity normalization. This produces BraTS-compatible data from any institutional format. It handles the preprocessing that Week 2 covered — but in an automated, reproducible pipeline.
Orchestrates multiple segmentation algorithms on the preprocessed data. Instead of running one model, you can run several BraTS challenge algorithms in parallel, each producing a candidate segmentation. The key insight from the original BraTS benchmark: no single algorithm performs best across all sub-regions. Running multiple models and combining them consistently outperforms any individual approach.
Combines the candidate segmentations into a single consensus result using fusion strategies: majority voting (simplest — each voxel gets the label predicted by the majority of models), SIMPLE fusion (iteratively selects the subset of models whose combination maximizes agreement), and weighted fusion (models weighted by their validation performance). A real-world evaluation found the toolkit performed well for round, well-demarcated tumors (97–100% accuracy for including necrosis and enhancing tumor), though complex infiltrative tumors still benefited from manual correction.
# BraTS Toolkit workflow (conceptual)
pip install BraTS-Toolkit
# Step 1: Preprocess raw data to BraTS format
brats-preprocess -i /raw/patient001 -o /preprocessed
# Step 2: Run multiple segmentation algorithms
brats-segment -i /preprocessed -o /segmentations
# Step 3: Fuse candidate segmentations
brats-fuse -i /segmentations -o /final \
--method simple_fusion
MONAI Deploy is the final stage of the MONAI ecosystem: MONAI Label (annotation) → MONAI Core (training) → MONAI Deploy (deployment). It packages your trained model as a MONAI Application Package (MAP) — a standardized, clinical-grade Docker container with DICOM I/O built in.
A Python SDK for building inference pipelines as directed acyclic graphs (DAGs) of operators. Each operator handles one step: DICOM loading, preprocessing, model inference, post-processing, DICOM output. You chain them together, and the SDK manages data flow and execution. The first published clinical implementation achieved median inference of 33 seconds with 57/58 successful cases, and the output fed directly into biopsy planning software.
Handles the DICOM networking. Receives DICOM images from PACS via standard DICOM C-STORE, routes them to the right MAP for processing, and sends results back to PACS. Also supports HL7 FHIR for integrating with electronic health records. This is the component that makes your model speak the language of the hospital.
Orchestrates multi-step clinical AI workflows. If your pipeline requires multiple models (e.g., skull stripping → tumor segmentation → radiomic feature extraction), the Workflow Manager sequences them, manages intermediate data, and handles failures gracefully.
Your model outputs a NIfTI segmentation mask. The hospital needs DICOM. Bridging this gap is one of the most underappreciated engineering challenges in clinical AI deployment. Three specialized libraries handle this:
A high-level Python library that abstracts DICOM encoding complexity. Create DICOM Segmentation objects (SEG) from NumPy arrays in a few lines of code. It handles the metadata correctly — referencing the source images, encoding segment properties, and maintaining DICOM compliance. Published in the Journal of Digital Imaging, it’s the most Pythonic option.
Converts between research formats (NIfTI, NRRD) and DICOM standards (SEG, SR, PM). Available as a command-line tool, Docker image, and 3D Slicer extension. Developed at Harvard’s Surgical Planning Lab and published in Cancer Research. The go-to for converting nnU-Net outputs to DICOM for PACS integration.
Specifically designed for radiation therapy workflows. Converts between NIfTI and DICOM RT Structure Sets (the format radiation oncologists use for tumor contours). Unlike 2D slice-based reconstruction, PyRaDiSe uses 3D-based conversion to avoid pixelated contours. Integrates with any deep learning framework.
Embedded a UNETR model directly into the Visage 7 diagnostic workstation. Brain tumor segmentation completed in ~4 seconds, with 106 radiomic features extracted in 5.8 seconds. Segmentation was available before radiologists opened the study, and they could verify and modify contours within their familiar PACS tools. Achieved median Dice of 86% on internal validation. This is the gold standard for workflow-efficient clinical integration.
Fully automated glioblastoma response assessment using XNAT open-source platform. Processing triggered automatically after MRI acquisition with no manual intervention. Segmentation masks and longitudinal volume charts pushed back to PACS. Demonstrated robust performance across 34 institutions in the EORTC-26101 trial. Proved that automated volumetric monitoring for RANO assessment is feasible at scale.
End-to-end pipeline for pre- and postoperative glioma segmentation. Total processing: ~10 minutes (routing ~1 min, preprocessing ~6 min, segmentation ~1–2 min, post-processing ~1 min). Achieved median Dice of 0.88/0.89/0.81 for WT/TC/ET. Demonstrated that clinical-grade latency is achievable with coordinated preprocessing optimization.
Open-source Python system for PACS interaction. Mean AI latency of 9–20 minutes vs clinician latency of 51–66 minutes (P<0.001). Supports both retrospective data retrieval and prospective AI pipeline deployment. Demonstrates that even without deep PACS integration, middleware orchestration can achieve clinically meaningful speed improvements.
Open-source software with both GUI and processing backend for CNS tumor segmentation. Models for glioblastomas, lower-grade gliomas, meningiomas, and metastases (pre- and postoperative). Preoperative Dice ~85% with patient-wise recall/precision ~95%. Runs on regular laptops in ~10 minutes without specialized hardware. Includes standardized clinical report generation. The most accessible deployment for individual researchers.
A model that takes 30 minutes to segment one brain isn’t clinically useful. Optimization techniques can dramatically reduce inference time without sacrificing accuracy.
Optimizes PyTorch/ONNX models for NVIDIA GPU inference through operator fusion, kernel auto-tuning, and precision calibration. A retinal segmentation model optimized with TensorRT achieved 3.5ms inference — 21× faster than the unoptimized version with no accuracy loss. The total pipeline (acquisition to result) had just 41ms latency.
Cross-platform inference engine. Convert your PyTorch model to ONNX format, then run it on any hardware (NVIDIA, AMD, CPU, mobile). A systematic evaluation showed ONNX Runtime substantially improved runtime across radiology, histopathology, and RGB imaging without compromising model utility. Particularly valuable for deploying on hardware you don’t control.
Reduce model precision from 32-bit floats to 16-bit (FP16), 8-bit (INT8), or even 2-bit (ternary). EfficientQ achieves post-training quantization in less than 5 minutes on one GPU with one data sample, with superior performance on BraTS 2020. MedQ demonstrated lossless 2-bit quantization on BraTS 2020 — performance equivalent to full precision while enabling boolean arithmetic. This makes deployment on embedded devices or consumer hardware feasible.
The most radical option: run inference entirely in the user’s web browser using WebGL/WebGPU, with no server at all. A published implementation deployed a 3D CNN on 256×256×256 CT volumes with 80-second runtime and 1.5GB memory on consumer hardware. PHI never leaves the user’s machine. This approach is emerging but promising for privacy-sensitive deployment.
Deployment isn’t the finish line — it’s the start of a new lifecycle. Models degrade over time as scanners are updated, protocols change, and patient populations shift. You need systems to detect problems before they impact patients.
Medical Machine Learning Operations adapts software engineering’s CI/CD (Continuous Integration/Continuous Deployment) to clinical AI. It provides structured approaches for continuous performance monitoring, systematic validation, simplified model maintenance, and regulatory compliance. The FDA now emphasizes a total product lifecycle approach: plan/design → data collection → model building → verification → deployment → monitoring → real-world evaluation — as a continuous cycle, not a one-time process.
The most advanced published approach uses a 14-day rolling window analysis combining black-box shift estimation (BBSE) and maximum mean discrepancy (MMD) to detect distributional changes without needing ground-truth labels. When drift is detected, the system triggers model updating using data from the previous 60 days. During COVID-19, this continual learning approach improved AUROC by 0.44 compared to a static model (P=0.007). Critically, detection is label-agnostic — you don’t need new expert annotations to know the model is degrading.
Before full clinical deployment, the DEPLOYR framework recommends silent trials: the model runs on real clinical data but its outputs aren’t shown to clinicians. This allows prospective performance measurement without clinical impact. Studies consistently find that prospectively measured performance differs from retrospective estimates, making silent deployment an essential pre-launch step.