Week 10 — Q4

Deploying Your Model: MONAI Deploy, Streamlit, BraTS Toolkit & Beyond

Last week was the “why” of clinical integration. This week is the “how” — the actual tools, containers, APIs, and pipelines that take a trained model out of a Jupyter notebook and make it usable by researchers, collaborators, and eventually clinicians.

The Deployment Spectrum

There isn’t one way to deploy a segmentation model. The right approach depends on your audience, your infrastructure, and your goals. This week covers four levels, from simplest to most clinical.

LEVEL 1
Research Demo (Streamlit / Gradio)

An interactive web app where colleagues can upload a NIfTI file and see the segmentation. Takes hours to build, runs on a laptop. Great for papers, grant demos, and collaborator feedback. Not for clinical use.

LEVEL 2
Reproducible Package (Docker + CLI)

Your model, environment, and dependencies in a Docker container with a command-line interface. Anyone can run inference with one command. TotalSegmentator popularized this model — TotalSegmentator -i input.nii.gz -o output. Reproducible, shareable, publishable.

LEVEL 3
Algorithm Orchestration (BraTS Toolkit)

Run multiple segmentation algorithms, fuse their predictions, and produce a standardized result. The BraTS Toolkit wraps the full pipeline: preprocessing, multi-algorithm segmentation, and fusion. Used for research benchmarking and multi-model ensembling.

LEVEL 4
Clinical Integration (MONAI Deploy / PACS Pipeline)

Your model runs as a DICOM-in, DICOM-out service integrated into the hospital PACS. Radiologists see the segmentation in their normal viewer. Requires DICOM handling, GPU infrastructure, monitoring, and regulatory compliance. The ultimate goal.

4 sec
Segmentation time for PACS-integrated brain tumor AI at Yale
33 sec
Median inference for MONAI Deploy Express prostate segmentation
80 sec
Browser-based 3D CNN inference on consumer hardware (no server)

Streamlit & Gradio: Build It This Week

The fastest way to make your model accessible. Streamlit turns a Python script into an interactive web application. In under 100 lines of code, you can build an app where users upload a brain MRI, your model runs inference, and the segmentation is displayed as an interactive overlay with axial/sagittal/coronal sliders.

# Minimal Streamlit app for brain tumor segmentation
import streamlit as st
import nibabel as nib
import numpy as np
import torch

st.title("Brain Tumor Segmentation Demo")

uploaded = st.file_uploader("Upload NIfTI (T1ce)", type=["nii", "gz"])
if uploaded:
  # Load image
  img = nib.load(uploaded).get_fdata()

  # Run model (your trained nnU-Net/MONAI model)
  prediction = run_inference(img)  # Your function here

  # Display with interactive slice selector
  axis = st.selectbox("View", ["Axial", "Sagittal", "Coronal"])
  slice_idx = st.slider("Slice", 0, img.shape[2]-1, img.shape[2]//2)

  # Overlay segmentation on image
  fig = create_overlay(img, prediction, slice_idx, axis)
  st.pyplot(fig)

  # Display metrics
  vol_ml = np.sum(prediction > 0) * voxel_volume / 1000
  st.metric("Tumor Volume", f"{vol_ml:.1f} mL")
Deployment Options

Streamlit Cloud: Free hosting for public apps. Limited to CPU (no GPU), so inference will be slow for 3D models. Good for demos with lightweight 2D models or pre-computed results. Hugging Face Spaces: Free hosting with optional GPU access (A10G, T4). Better for real-time inference. Institutional server: Run on a GPU workstation behind your university’s firewall. Best for processing real (non-public) data. FastAPI is the modern alternative when you need API-style deployment instead of a visual interface.

⚠️
Streamlit is NOT for clinical use. It doesn’t handle DICOM, doesn’t meet regulatory requirements, has no audit trail, and isn’t designed for healthcare security standards. Use it for: demos to collaborators, paper figures, grant applications, conference presentations, and proof-of-concept validation. One published brain tumor web app included a feedback facility that let clinicians refine results, which were then used to retrain the model — a powerful research loop.

Docker: The Foundation of Everything

Docker packages your model, Python environment, CUDA drivers, and all dependencies into a container that runs identically on any machine. This solves the “works on my laptop” problem. It’s the foundation for MONAI Deploy (MAPs are Docker containers), challenge submissions (BraTS requires Docker), and clinical deployment.

# Dockerfile for nnU-Net inference
FROM nvcr.io/nvidia/pytorch:23.10-py3

# Install nnU-Net
RUN pip install nnunetv2

# Copy trained model weights
COPY ./nnUNet_results /opt/nnUNet_results

# Set environment variables
ENV nnUNet_results=/opt/nnUNet_results
ENV nnUNet_raw=/opt/nnUNet_raw
ENV nnUNet_preprocessed=/opt/nnUNet_preprocessed

# Entrypoint: run inference
ENTRYPOINT ["nnUNetv2_predict"]
# Build and run
docker build -t brats-segmentation .
docker run --gpus all \
  -v /path/to/input:/input \
  -v /path/to/output:/output \
  brats-segmentation \
  -i /input -o /output -d 001 -c 3d_fullres
The TotalSegmentator Model

TotalSegmentator is the gold standard for how to package an nnU-Net model for practical use. It segments 80 anatomical structures from a single command: TotalSegmentator -i input.nii.gz -o output. Under the hood, it’s nnU-Net in a pip-installable Python package. The MRI version achieved Dice 0.839 for 80 structures and nearly matched the CT version (0.966 vs 0.970). It’s available as both a command-line tool and a web app at totalsegmentator.com. This is the deployment pattern to emulate for your own models.

💡
Docker best practices for medical AI: Use multi-stage builds to keep image sizes small. Pin every dependency version (don’t use pip install nnunetv2 — use pip install nnunetv2==2.5.1). Include health checks. Log all predictions for audit trails. Tag images with the model version and training date. The published “Ten Simple Rules for Writing Dockerfiles” is required reading.

BraTS Toolkit: Preprocessing, Segmentation & Fusion

The BraTS Toolkit is a three-component system that standardizes the entire brain tumor segmentation workflow. It was developed specifically to bridge the gap between challenge-winning algorithms and practical use.

BraTS Preprocessor

Standardizes raw brain MRI data through the full pipeline: DICOM-to-NIfTI conversion, co-registration of all four modalities to a common space, atlas registration to SRI24 space, skull stripping, and intensity normalization. This produces BraTS-compatible data from any institutional format. It handles the preprocessing that Week 2 covered — but in an automated, reproducible pipeline.

BraTS Segmentor

Orchestrates multiple segmentation algorithms on the preprocessed data. Instead of running one model, you can run several BraTS challenge algorithms in parallel, each producing a candidate segmentation. The key insight from the original BraTS benchmark: no single algorithm performs best across all sub-regions. Running multiple models and combining them consistently outperforms any individual approach.

BraTS Fusionator

Combines the candidate segmentations into a single consensus result using fusion strategies: majority voting (simplest — each voxel gets the label predicted by the majority of models), SIMPLE fusion (iteratively selects the subset of models whose combination maximizes agreement), and weighted fusion (models weighted by their validation performance). A real-world evaluation found the toolkit performed well for round, well-demarcated tumors (97–100% accuracy for including necrosis and enhancing tumor), though complex infiltrative tumors still benefited from manual correction.

# BraTS Toolkit workflow (conceptual)
pip install BraTS-Toolkit

# Step 1: Preprocess raw data to BraTS format
brats-preprocess -i /raw/patient001 -o /preprocessed

# Step 2: Run multiple segmentation algorithms
brats-segment -i /preprocessed -o /segmentations

# Step 3: Fuse candidate segmentations
brats-fuse -i /segmentations -o /final \
  --method simple_fusion

MONAI Deploy: From Research to Clinic

MONAI Deploy is the final stage of the MONAI ecosystem: MONAI Label (annotation) → MONAI Core (training) → MONAI Deploy (deployment). It packages your trained model as a MONAI Application Package (MAP) — a standardized, clinical-grade Docker container with DICOM I/O built in.

The Three Components

MONAI Deploy App SDK

A Python SDK for building inference pipelines as directed acyclic graphs (DAGs) of operators. Each operator handles one step: DICOM loading, preprocessing, model inference, post-processing, DICOM output. You chain them together, and the SDK manages data flow and execution. The first published clinical implementation achieved median inference of 33 seconds with 57/58 successful cases, and the output fed directly into biopsy planning software.

MONAI Deploy Informatics Gateway

Handles the DICOM networking. Receives DICOM images from PACS via standard DICOM C-STORE, routes them to the right MAP for processing, and sends results back to PACS. Also supports HL7 FHIR for integrating with electronic health records. This is the component that makes your model speak the language of the hospital.

MONAI Deploy Workflow Manager

Orchestrates multi-step clinical AI workflows. If your pipeline requires multiple models (e.g., skull stripping → tumor segmentation → radiomic feature extraction), the Workflow Manager sequences them, manages intermediate data, and handles failures gracefully.

📚
Why MONAI Deploy over a custom Docker container? You could build your own DICOM-in/DICOM-out Docker container from scratch. But MONAI Deploy provides standardized DICOM I/O operators, tested integration with PACS systems, a workflow management layer, and a growing community of validated MAPs. It’s the difference between building a house from lumber vs. building from pre-fabricated components. For a student project, either works. For clinical deployment, the standardization matters.

DICOM Conversion: The Glue Layer

Your model outputs a NIfTI segmentation mask. The hospital needs DICOM. Bridging this gap is one of the most underappreciated engineering challenges in clinical AI deployment. Three specialized libraries handle this:

highdicom

A high-level Python library that abstracts DICOM encoding complexity. Create DICOM Segmentation objects (SEG) from NumPy arrays in a few lines of code. It handles the metadata correctly — referencing the source images, encoding segment properties, and maintaining DICOM compliance. Published in the Journal of Digital Imaging, it’s the most Pythonic option.

dcmqi (DICOM for Quantitative Imaging)

Converts between research formats (NIfTI, NRRD) and DICOM standards (SEG, SR, PM). Available as a command-line tool, Docker image, and 3D Slicer extension. Developed at Harvard’s Surgical Planning Lab and published in Cancer Research. The go-to for converting nnU-Net outputs to DICOM for PACS integration.

PyRaDiSe (Python DICOM-RT)

Specifically designed for radiation therapy workflows. Converts between NIfTI and DICOM RT Structure Sets (the format radiation oncologists use for tumor contours). Unlike 2D slice-based reconstruction, PyRaDiSe uses 3D-based conversion to avoid pixelated contours. Integrates with any deep learning framework.

💡
Which format? DICOM SEG is the standard for encoding voxel-level segmentation masks. Used when feeding into PACS viewers. DICOM RT Structure Set is the standard in radiation therapy for encoding contours. Used when feeding into treatment planning systems. DICOM Secondary Capture is a screenshot-like format for visualization only (lowest utility but simplest). Choose based on your downstream consumer.

Published Deployment Case Studies

Yale PACS-Integrated System (2022)

Embedded a UNETR model directly into the Visage 7 diagnostic workstation. Brain tumor segmentation completed in ~4 seconds, with 106 radiomic features extracted in 5.8 seconds. Segmentation was available before radiologists opened the study, and they could verify and modify contours within their familiar PACS tools. Achieved median Dice of 86% on internal validation. This is the gold standard for workflow-efficient clinical integration.

Heidelberg XNAT Pipeline (2019)

Fully automated glioblastoma response assessment using XNAT open-source platform. Processing triggered automatically after MRI acquisition with no manual intervention. Segmentation masks and longitudinal volume charts pushed back to PACS. Demonstrated robust performance across 34 institutions in the EORTC-26101 trial. Proved that automated volumetric monitoring for RANO assessment is feasible at scale.

NYU Deep Learning Pipeline (2022)

End-to-end pipeline for pre- and postoperative glioma segmentation. Total processing: ~10 minutes (routing ~1 min, preprocessing ~6 min, segmentation ~1–2 min, post-processing ~1 min). Achieved median Dice of 0.88/0.89/0.81 for WT/TC/ET. Demonstrated that clinical-grade latency is achievable with coordinated preprocessing optimization.

DIANA Orchestration System (2021)

Open-source Python system for PACS interaction. Mean AI latency of 9–20 minutes vs clinician latency of 51–66 minutes (P<0.001). Supports both retrospective data retrieval and prospective AI pipeline deployment. Demonstrates that even without deep PACS integration, middleware orchestration can achieve clinically meaningful speed improvements.

Raidionics (2023)

Open-source software with both GUI and processing backend for CNS tumor segmentation. Models for glioblastomas, lower-grade gliomas, meningiomas, and metastases (pre- and postoperative). Preoperative Dice ~85% with patient-wise recall/precision ~95%. Runs on regular laptops in ~10 minutes without specialized hardware. Includes standardized clinical report generation. The most accessible deployment for individual researchers.

Inference Optimization

A model that takes 30 minutes to segment one brain isn’t clinically useful. Optimization techniques can dramatically reduce inference time without sacrificing accuracy.

TensorRT (NVIDIA)

Optimizes PyTorch/ONNX models for NVIDIA GPU inference through operator fusion, kernel auto-tuning, and precision calibration. A retinal segmentation model optimized with TensorRT achieved 3.5ms inference — 21× faster than the unoptimized version with no accuracy loss. The total pipeline (acquisition to result) had just 41ms latency.

ONNX Runtime

Cross-platform inference engine. Convert your PyTorch model to ONNX format, then run it on any hardware (NVIDIA, AMD, CPU, mobile). A systematic evaluation showed ONNX Runtime substantially improved runtime across radiology, histopathology, and RGB imaging without compromising model utility. Particularly valuable for deploying on hardware you don’t control.

Model Quantization

Reduce model precision from 32-bit floats to 16-bit (FP16), 8-bit (INT8), or even 2-bit (ternary). EfficientQ achieves post-training quantization in less than 5 minutes on one GPU with one data sample, with superior performance on BraTS 2020. MedQ demonstrated lossless 2-bit quantization on BraTS 2020 — performance equivalent to full precision while enabling boolean arithmetic. This makes deployment on embedded devices or consumer hardware feasible.

Browser-Based Edge Computing

The most radical option: run inference entirely in the user’s web browser using WebGL/WebGPU, with no server at all. A published implementation deployed a 3D CNN on 256×256×256 CT volumes with 80-second runtime and 1.5GB memory on consumer hardware. PHI never leaves the user’s machine. This approach is emerging but promising for privacy-sensitive deployment.

Continuous Monitoring & MLOps

Deployment isn’t the finish line — it’s the start of a new lifecycle. Models degrade over time as scanners are updated, protocols change, and patient populations shift. You need systems to detect problems before they impact patients.

MedMLOps Framework

Medical Machine Learning Operations adapts software engineering’s CI/CD (Continuous Integration/Continuous Deployment) to clinical AI. It provides structured approaches for continuous performance monitoring, systematic validation, simplified model maintenance, and regulatory compliance. The FDA now emphasizes a total product lifecycle approach: plan/design → data collection → model building → verification → deployment → monitoring → real-world evaluation — as a continuous cycle, not a one-time process.

Drift Detection

The most advanced published approach uses a 14-day rolling window analysis combining black-box shift estimation (BBSE) and maximum mean discrepancy (MMD) to detect distributional changes without needing ground-truth labels. When drift is detected, the system triggers model updating using data from the previous 60 days. During COVID-19, this continual learning approach improved AUROC by 0.44 compared to a static model (P=0.007). Critically, detection is label-agnostic — you don’t need new expert annotations to know the model is degrading.

Silent Deployment

Before full clinical deployment, the DEPLOYR framework recommends silent trials: the model runs on real clinical data but its outputs aren’t shown to clinicians. This allows prospective performance measurement without clinical impact. Studies consistently find that prospectively measured performance differs from retrospective estimates, making silent deployment an essential pre-launch step.

This Week’s Learning Resources

Build These This Week

Official docs for building MONAI Application Packages. Work through the “Creating a Simple App” tutorial. This is the canonical path for packaging your model for clinical deployment.
Build a brain tumor segmentation demo this week. Upload NIfTI, display axial/sagittal/coronal slices with segmentation overlay, compute and display tumor volume. Under 100 lines of code.
Install and run the three-component pipeline on your BraTS data. Compare single-model prediction to fused multi-model predictions to see the value of algorithm orchestration.
Convert your NIfTI segmentation output to a DICOM SEG object. Essential for any PACS integration. The most Pythonic way to create standards-compliant DICOM.
Study this as a deployment template. See how they wrapped nnU-Net into a pip-installable CLI tool with simple one-command inference. Emulate this pattern for your own model.

Key Papers

The Yale PACS-integrated system: 4-second segmentation, 106 radiomic features in 5.8 seconds, results available before the radiologist opens the study. The benchmark for clinical integration.
Front Neurosci. 2022;16:860208
The three-component preprocessing/segmentation/fusion toolkit. Essential reading for understanding algorithm orchestration and standardized BraTS data handling.
Front Neurosci. 2020;14:125
The highdicom library paper. How to encode segmentation masks, structured reports, and parametric maps in standards-compliant DICOM from Python.
J Digit Imaging. 2022;35:1719–1737
80-structure MRI segmentation packaged as a one-command tool. Dice 0.839 across 80 structures. The deployment model to emulate.
Radiology. 2025;314(1):e241613
Open-source software that runs on laptops with Dice ~85%. The most accessible deployment path for individual researchers without GPU infrastructure.
Sci Rep. 2023;13:9631

Deep Dives

Foundational guidelines for reproducible container-based deployment. Follow these rules for every Dockerfile you write.
First published MONAI Deploy clinical implementation: 33-second median inference, PACS integration, downstream biopsy planning. The evidence that MONAI Deploy works in practice.
The Heidelberg XNAT pipeline: fully automated, manufacturer-neutral, validated across 34 trial institutions. Proved that automated RANO monitoring is feasible at scale.
Quantize a 3D U-Net in under 5 minutes with one GPU and one data sample. Enables deployment on resource-constrained hardware.