Week 10: Deploying Your Model — MedVision Academy

// 10.1 Choosing Your Path

The Deployment Spectrum

There isn’t one way to deploy a segmentation model. The right approach depends on your audience, your infrastructure, and your goals. This week covers four levels, from simplest to most clinical.

LEVEL 1

Research Demo (Streamlit / Gradio)

An interactive web app where colleagues can upload a NIfTI file and see the segmentation. Takes hours to build, runs on a laptop. Great for papers, grant demos, and collaborator feedback. Not for clinical use.

LEVEL 2

Reproducible Package (Docker + CLI)

Your model, environment, and dependencies in a Docker container with a command-line interface. Anyone can run inference with one command. TotalSegmentator popularized this model — TotalSegmentator -i input.nii.gz -o output. Reproducible, shareable, publishable.

LEVEL 3

Algorithm Orchestration (BraTS Toolkit)

Run multiple segmentation algorithms, fuse their predictions, and produce a standardized result. The BraTS Toolkit wraps the full pipeline: preprocessing, multi-algorithm segmentation, and fusion. Used for research benchmarking and multi-model ensembling.

LEVEL 4

Clinical Integration (MONAI Deploy / PACS Pipeline)

Your model runs as a DICOM-in, DICOM-out service integrated into the hospital PACS. Radiologists see the segmentation in their normal viewer. Requires DICOM handling, GPU infrastructure, monitoring, and regulatory compliance. The ultimate goal.

4 sec

Segmentation time for PACS-integrated brain tumor AI at Yale

33 sec

Median inference for MONAI Deploy Express prostate segmentation

80 sec

Browser-based 3D CNN inference on consumer hardware (no server)

// 10.2 Level 1: Research Demos

Streamlit & Gradio: Build It This Week

The fastest way to make your model accessible. Streamlit turns a Python script into an interactive web application. In under 100 lines of code, you can build an app where users upload a brain MRI, your model runs inference, and the segmentation is displayed as an interactive overlay with axial/sagittal/coronal sliders.

      
# Minimal Streamlit app for brain tumor segmentation

import streamlit as st

import nibabel as nib

import numpy as np

import torch

st.title("Brain Tumor Segmentation Demo")

uploaded = st.file_uploader("Upload NIfTI (T1ce)", type=["nii", "gz"])

if uploaded:

  # Load image

  img = nib.load(uploaded).get_fdata()

  # Run model (your trained nnU-Net/MONAI model)

  prediction = run_inference(img)  # Your function here

  # Display with interactive slice selector

  axis = st.selectbox("View", ["Axial", "Sagittal", "Coronal"])

  slice_idx = st.slider("Slice", 0, img.shape[2]-1, img.shape[2]//2)

  # Overlay segmentation on image

  fig = create_overlay(img, prediction, slice_idx, axis)

  st.pyplot(fig)

  # Display metrics

  vol_ml = np.sum(prediction > 0) * voxel_volume / 1000

  st.metric("Tumor Volume", f"{vol_ml:.1f} mL")

Deployment Options

Streamlit Cloud: Free hosting for public apps. Limited to CPU (no GPU), so inference will be slow for 3D models. Good for demos with lightweight 2D models or pre-computed results. Hugging Face Spaces: Free hosting with optional GPU access (A10G, T4). Better for real-time inference. Institutional server: Run on a GPU workstation behind your university’s firewall. Best for processing real (non-public) data. FastAPI is the modern alternative when you need API-style deployment instead of a visual interface.

⚠️

Streamlit is NOT for clinical use. It doesn’t handle DICOM, doesn’t meet regulatory requirements, has no audit trail, and isn’t designed for healthcare security standards. Use it for: demos to collaborators, paper figures, grant applications, conference presentations, and proof-of-concept validation. One published brain tumor web app included a feedback facility that let clinicians refine results, which were then used to retrain the model — a powerful research loop.

// 10.3 Level 2: Reproducible Containers

Docker: The Foundation of Everything

Docker packages your model, Python environment, CUDA drivers, and all dependencies into a container that runs identically on any machine. This solves the “works on my laptop” problem. It’s the foundation for MONAI Deploy (MAPs are Docker containers), challenge submissions (BraTS requires Docker), and clinical deployment.

      
# Dockerfile for nnU-Net inference

FROM nvcr.io/nvidia/pytorch:23.10-py3

# Install nnU-Net

RUN pip install nnunetv2

# Copy trained model weights

COPY ./nnUNet_results /opt/nnUNet_results

# Set environment variables

ENV nnUNet_results=/opt/nnUNet_results

ENV nnUNet_raw=/opt/nnUNet_raw

ENV nnUNet_preprocessed=/opt/nnUNet_preprocessed

# Entrypoint: run inference

ENTRYPOINT ["nnUNetv2_predict"]

      
# Build and run

docker build -t brats-segmentation .

docker run --gpus all \

  -v /path/to/input:/input \

  -v /path/to/output:/output \

  brats-segmentation \

  -i /input -o /output -d 001 -c 3d_fullres

The TotalSegmentator Model

TotalSegmentator is the gold standard for how to package an nnU-Net model for practical use. It segments 80 anatomical structures from a single command: TotalSegmentator -i input.nii.gz -o output. Under the hood, it’s nnU-Net in a pip-installable Python package. The MRI version achieved Dice 0.839 for 80 structures and nearly matched the CT version (0.966 vs 0.970). It’s available as both a command-line tool and a web app at totalsegmentator.com. This is the deployment pattern to emulate for your own models.

💡

Docker best practices for medical AI: Use multi-stage builds to keep image sizes small. Pin every dependency version (don’t use pip install nnunetv2 — use pip install nnunetv2==2.5.1). Include health checks. Log all predictions for audit trails. Tag images with the model version and training date. The published “Ten Simple Rules for Writing Dockerfiles” is required reading.

// 10.4 Level 3: Multi-Algorithm Orchestration

BraTS Toolkit: Preprocessing, Segmentation & Fusion

The BraTS Toolkit is a three-component system that standardizes the entire brain tumor segmentation workflow. It was developed specifically to bridge the gap between challenge-winning algorithms and practical use.

BraTS Preprocessor

Standardizes raw brain MRI data through the full pipeline: DICOM-to-NIfTI conversion, co-registration of all four modalities to a common space, atlas registration to SRI24 space, skull stripping, and intensity normalization. This produces BraTS-compatible data from any institutional format. It handles the preprocessing that Week 2 covered — but in an automated, reproducible pipeline.

BraTS Segmentor

Orchestrates multiple segmentation algorithms on the preprocessed data. Instead of running one model, you can run several BraTS challenge algorithms in parallel, each producing a candidate segmentation. The key insight from the original BraTS benchmark: no single algorithm performs best across all sub-regions. Running multiple models and combining them consistently outperforms any individual approach.

BraTS Fusionator

Combines the candidate segmentations into a single consensus result using fusion strategies: majority voting (simplest — each voxel gets the label predicted by the majority of models), SIMPLE fusion (iteratively selects the subset of models whose combination maximizes agreement), and weighted fusion (models weighted by their validation performance). A real-world evaluation found the toolkit performed well for round, well-demarcated tumors (97–100% accuracy for including necrosis and enhancing tumor), though complex infiltrative tumors still benefited from manual correction.

      
# BraTS Toolkit workflow (conceptual)

pip install BraTS-Toolkit

# Step 1: Preprocess raw data to BraTS format

brats-preprocess -i /raw/patient001 -o /preprocessed

# Step 2: Run multiple segmentation algorithms

brats-segment -i /preprocessed -o /segmentations

# Step 3: Fuse candidate segmentations

brats-fuse -i /segmentations -o /final \

  --method simple_fusion

// 10.5 Level 4: Clinical-Grade Deployment

MONAI Deploy: From Research to Clinic

MONAI Deploy is the final stage of the MONAI ecosystem: MONAI Label (annotation) → MONAI Core (training) → MONAI Deploy (deployment). It packages your trained model as a MONAI Application Package (MAP) — a standardized, clinical-grade Docker container with DICOM I/O built in.

The Three Components

MONAI Deploy App SDK

A Python SDK for building inference pipelines as directed acyclic graphs (DAGs) of operators. Each operator handles one step: DICOM loading, preprocessing, model inference, post-processing, DICOM output. You chain them together, and the SDK manages data flow and execution. The first published clinical implementation achieved median inference of 33 seconds with 57/58 successful cases, and the output fed directly into biopsy planning software.

MONAI Deploy Informatics Gateway

Handles the DICOM networking. Receives DICOM images from PACS via standard DICOM C-STORE, routes them to the right MAP for processing, and sends results back to PACS. Also supports HL7 FHIR for integrating with electronic health records. This is the component that makes your model speak the language of the hospital.

MONAI Deploy Workflow Manager

Orchestrates multi-step clinical AI workflows. If your pipeline requires multiple models (e.g., skull stripping → tumor segmentation → radiomic feature extraction), the Workflow Manager sequences them, manages intermediate data, and handles failures gracefully.

📚

Why MONAI Deploy over a custom Docker container? You could build your own DICOM-in/DICOM-out Docker container from scratch. But MONAI Deploy provides standardized DICOM I/O operators, tested integration with PACS systems, a workflow management layer, and a growing community of validated MAPs. It’s the difference between building a house from lumber vs. building from pre-fabricated components. For a student project, either works. For clinical deployment, the standardization matters.

// 10.6 Speaking the Hospital’s Language

DICOM Conversion: The Glue Layer

Your model outputs a NIfTI segmentation mask. The hospital needs DICOM. Bridging this gap is one of the most underappreciated engineering challenges in clinical AI deployment. Three specialized libraries handle this:

highdicom

A high-level Python library that abstracts DICOM encoding complexity. Create DICOM Segmentation objects (SEG) from NumPy arrays in a few lines of code. It handles the metadata correctly — referencing the source images, encoding segment properties, and maintaining DICOM compliance. Published in the Journal of Digital Imaging, it’s the most Pythonic option.

dcmqi (DICOM for Quantitative Imaging)

Converts between research formats (NIfTI, NRRD) and DICOM standards (SEG, SR, PM). Available as a command-line tool, Docker image, and 3D Slicer extension. Developed at Harvard’s Surgical Planning Lab and published in Cancer Research. The go-to for converting nnU-Net outputs to DICOM for PACS integration.

PyRaDiSe (Python DICOM-RT)

Specifically designed for radiation therapy workflows. Converts between NIfTI and DICOM RT Structure Sets (the format radiation oncologists use for tumor contours). Unlike 2D slice-based reconstruction, PyRaDiSe uses 3D-based conversion to avoid pixelated contours. Integrates with any deep learning framework.

💡

Which format? DICOM SEG is the standard for encoding voxel-level segmentation masks. Used when feeding into PACS viewers. DICOM RT Structure Set is the standard in radiation therapy for encoding contours. Used when feeding into treatment planning systems. DICOM Secondary Capture is a screenshot-like format for visualization only (lowest utility but simplest). Choose based on your downstream consumer.

// 10.7 What Others Have Built

Published Deployment Case Studies

Yale PACS-Integrated System (2022)

Embedded a UNETR model directly into the Visage 7 diagnostic workstation. Brain tumor segmentation completed in ~4 seconds, with 106 radiomic features extracted in 5.8 seconds. Segmentation was available before radiologists opened the study, and they could verify and modify contours within their familiar PACS tools. Achieved median Dice of 86% on internal validation. This is the gold standard for workflow-efficient clinical integration.

Heidelberg XNAT Pipeline (2019)

Fully automated glioblastoma response assessment using XNAT open-source platform. Processing triggered automatically after MRI acquisition with no manual intervention. Segmentation masks and longitudinal volume charts pushed back to PACS. Demonstrated robust performance across 34 institutions in the EORTC-26101 trial. Proved that automated volumetric monitoring for RANO assessment is feasible at scale.

NYU Deep Learning Pipeline (2022)

End-to-end pipeline for pre- and postoperative glioma segmentation. Total processing: ~10 minutes (routing ~1 min, preprocessing ~6 min, segmentation ~1–2 min, post-processing ~1 min). Achieved median Dice of 0.88/0.89/0.81 for WT/TC/ET. Demonstrated that clinical-grade latency is achievable with coordinated preprocessing optimization.

DIANA Orchestration System (2021)

Open-source Python system for PACS interaction. Mean AI latency of 9–20 minutes vs clinician latency of 51–66 minutes (P<0.001). Supports both retrospective data retrieval and prospective AI pipeline deployment. Demonstrates that even without deep PACS integration, middleware orchestration can achieve clinically meaningful speed improvements.

Raidionics (2023)

Open-source software with both GUI and processing backend for CNS tumor segmentation. Models for glioblastomas, lower-grade gliomas, meningiomas, and metastases (pre- and postoperative). Preoperative Dice ~85% with patient-wise recall/precision ~95%. Runs on regular laptops in ~10 minutes without specialized hardware. Includes standardized clinical report generation. The most accessible deployment for individual researchers.

// 10.8 Making It Fast

Inference Optimization

A model that takes 30 minutes to segment one brain isn’t clinically useful. Optimization techniques can dramatically reduce inference time without sacrificing accuracy.

TensorRT (NVIDIA)

Optimizes PyTorch/ONNX models for NVIDIA GPU inference through operator fusion, kernel auto-tuning, and precision calibration. A retinal segmentation model optimized with TensorRT achieved 3.5ms inference — 21× faster than the unoptimized version with no accuracy loss. The total pipeline (acquisition to result) had just 41ms latency.

ONNX Runtime

Cross-platform inference engine. Convert your PyTorch model to ONNX format, then run it on any hardware (NVIDIA, AMD, CPU, mobile). A systematic evaluation showed ONNX Runtime substantially improved runtime across radiology, histopathology, and RGB imaging without compromising model utility. Particularly valuable for deploying on hardware you don’t control.

Model Quantization

Reduce model precision from 32-bit floats to 16-bit (FP16), 8-bit (INT8), or even 2-bit (ternary). EfficientQ achieves post-training quantization in less than 5 minutes on one GPU with one data sample, with superior performance on BraTS 2020. MedQ demonstrated lossless 2-bit quantization on BraTS 2020 — performance equivalent to full precision while enabling boolean arithmetic. This makes deployment on embedded devices or consumer hardware feasible.

Browser-Based Edge Computing

The most radical option: run inference entirely in the user’s web browser using WebGL/WebGPU, with no server at all. A published implementation deployed a 3D CNN on 256×256×256 CT volumes with 80-second runtime and 1.5GB memory on consumer hardware. PHI never leaves the user’s machine. This approach is emerging but promising for privacy-sensitive deployment.

// 10.9 Keeping It Running

Continuous Monitoring & MLOps

Deployment isn’t the finish line — it’s the start of a new lifecycle. Models degrade over time as scanners are updated, protocols change, and patient populations shift. You need systems to detect problems before they impact patients.

MedMLOps Framework

Medical Machine Learning Operations adapts software engineering’s CI/CD (Continuous Integration/Continuous Deployment) to clinical AI. It provides structured approaches for continuous performance monitoring, systematic validation, simplified model maintenance, and regulatory compliance. The FDA now emphasizes a total product lifecycle approach: plan/design → data collection → model building → verification → deployment → monitoring → real-world evaluation — as a continuous cycle, not a one-time process.

Drift Detection

The most advanced published approach uses a 14-day rolling window analysis combining black-box shift estimation (BBSE) and maximum mean discrepancy (MMD) to detect distributional changes without needing ground-truth labels. When drift is detected, the system triggers model updating using data from the previous 60 days. During COVID-19, this continual learning approach improved AUROC by 0.44 compared to a static model (P=0.007). Critically, detection is label-agnostic — you don’t need new expert annotations to know the model is degrading.

Silent Deployment

Before full clinical deployment, the DEPLOYR framework recommends silent trials: the model runs on real clinical data but its outputs aren’t shown to clinicians. This allows prospective performance measurement without clinical impact. Studies consistently find that prospectively measured performance differs from retrospective estimates, making silent deployment an essential pre-launch step.

// 10.10 Resources & Further Reading

This Week’s Learning Resources

Build These This Week

ToolMONAI Deploy App SDK Documentation

Official docs for building MONAI Application Packages. Work through the “Creating a Simple App” tutorial. This is the canonical path for packaging your model for clinical deployment.

ToolStreamlit

Build a brain tumor segmentation demo this week. Upload NIfTI, display axial/sagittal/coronal slices with segmentation overlay, compute and display tumor volume. Under 100 lines of code.

ToolBraTS Toolkit (GitHub)

Install and run the three-component pipeline on your BraTS data. Compare single-model prediction to fused multi-model predictions to see the value of algorithm orchestration.

Toolhighdicom — DICOM Encoding Library

Convert your NIfTI segmentation output to a DICOM SEG object. Essential for any PACS integration. The most Pythonic way to create standards-compliant DICOM.

ToolTotalSegmentator (GitHub)

Study this as a deployment template. See how they wrapped nnU-Net into a pip-installable CLI tool with simple one-command inference. Emulate this pattern for your own model.

Key Papers

PaperAboian et al. — PACS-Based Brain Tumor Segmentation & Radiomics (Frontiers in Neurosci, 2022)

The Yale PACS-integrated system: 4-second segmentation, 106 radiomic features in 5.8 seconds, results available before the radiologist opens the study. The benchmark for clinical integration.

Front Neurosci. 2022;16:860208

PaperKofler et al. — BraTS Toolkit (Frontiers in Neuroscience, 2020)

The three-component preprocessing/segmentation/fusion toolkit. Essential reading for understanding algorithm orchestration and standardized BraTS data handling.

Front Neurosci. 2020;14:125

PaperBridge et al. — highdicom: DICOM Encoding for AI (J Digital Imaging, 2022)

The highdicom library paper. How to encode segmentation masks, structured reports, and parametric maps in standards-compliant DICOM from Python.

J Digit Imaging. 2022;35:1719–1737

PaperAkinci D’Antonoli et al. — TotalSegmentator MRI (Radiology, 2025)

80-structure MRI segmentation packaged as a one-command tool. Dice 0.839 across 80 structures. The deployment model to emulate.

Radiology. 2025;314(1):e241613

PaperBouget et al. — Raidionics: Open CNS Tumor Segmentation (Scientific Reports, 2023)

Open-source software that runs on laptops with Dice ~85%. The most accessible deployment path for individual researchers without GPU infrastructure.

Sci Rep. 2023;13:9631

Deep Dives

PaperNüst et al. — Ten Simple Rules for Writing Dockerfiles (PLoS Comp Biol, 2020)

Foundational guidelines for reproducible container-based deployment. Follow these rules for every Dockerfile you write.

PaperHarmon et al. — MONAI Deploy Express Clinical Implementation (Abdominal Radiology, 2025)

First published MONAI Deploy clinical implementation: 33-second median inference, PACS integration, downstream biopsy planning. The evidence that MONAI Deploy works in practice.

PaperKickingereder et al. — Automated GBM Response Assessment (Lancet Oncol, 2019)

The Heidelberg XNAT pipeline: fully automated, manufacturer-neutral, validated across 34 trial institutions. Proved that automated RANO monitoring is feasible at scale.

PaperZhang & Chung — EfficientQ: Post-Training Quantization (Med Image Anal, 2024)

Quantize a 3D U-Net in under 5 minutes with one GPU and one data sample. Enables deployment on resource-constrained hardware.

← Week 09: Clinical Integration Week 11: Longitudinal Data & RANO →