Week 09 — Q3

Integrating AI Into the Radiology Workflow

You can build a model that achieves 0.93 Dice. But can you make it actually help a patient? This week bridges the gap between technical performance and clinical impact — covering regulatory approval, PACS integration, automation bias, deployment frameworks, and the ethical responsibilities that come with putting AI in front of a radiologist.

The Gap Between Research and Clinical Impact

The brain tumor segmentation field has achieved impressive technical metrics. But there is a stark gap between what models achieve on challenge leaderboards and what actually reaches patients. Understanding this gap is the first step toward closing it.

950+
FDA-cleared AI/ML medical devices as of 2025 — but very few for brain tumor segmentation
5%
Of FDA-cleared AI devices that underwent prospective clinical testing
97%
Cleared via 510(k) — no independent clinical evidence required

The number of cleared AI devices has exploded — from 33 between 1995–2015 to 221 in 2023 alone. But quantity doesn’t equal clinical adoption. Of 717 radiology AI devices with documentation, only 33 included prospective testing, 56 included human-in-the-loop evaluation, and only 15 combined both. Most evidence remains retrospective. A critical assessment noted that “clinical utility remains to be demonstrated” for the majority of neuroimaging AI tools.

The Regulatory Landscape

United States: FDA Pathways

510(k) (Most Common — 97% of AI devices)

Demonstrates “substantial equivalence” to an existing approved device (a predicate). Does not require independent clinical evidence — performance can be shown by comparison to the predicate. Faster than other pathways, but a significant limitation: each generation of device references the previous, so incremental risks can accumulate without rigorous independent testing. The overwhelming majority of medical AI reaches market this way.

De Novo (Novel Devices)

For devices with no suitable predicate but low-to-moderate risk. Requires more evidence than 510(k) but less than PMA. Only 22 AI/ML devices have used this pathway. Creates a new predicate that future 510(k) submissions can reference.

Predetermined Change Control Plans

AI models can change over time (retraining, adapting to new data). Traditional regulation assumes static devices. The FDA now allows manufacturers to specify anticipated modifications in advance, defining “regions of potential changes” with monitoring plans. This is critical for adaptive AI that learns from deployment data.

European Union: MDR + AI Act

The EU requires dual compliance. The Medical Device Regulation (MDR 2017) classifies most diagnostic AI as Class IIa or IIb. The AI Act (2024) introduces a risk-based framework where most medical AI is classified as “high-risk,” requiring risk management systems, data governance, transparency, human oversight, and ongoing monitoring. Unlike the FDA, no centralized public database of CE-marked AI devices exists in the EU, making the European landscape harder to track. Despite structural differences, EU, US, and Chinese regulations converge on core principles: patient data protection, risk classification, transparency, bias mitigation, and human oversight.

Clinical Validation: Beyond Dice Scores

Getting a high Dice score on BraTS is technical validation. Proving your tool helps patients requires three levels of evidence:

LEVEL 1
Technical Validity

Does the algorithm produce accurate segmentations on representative external test sets? This is what Dice, HD95, and challenge leaderboards measure. Necessary but not sufficient. Most FDA submissions stop here.

LEVEL 2
Clinical Validity

Does it work in real-world clinical scenarios with real patient populations? This requires prospective, multi-center studies with data from diverse scanners, protocols, and demographics. Only 33.8% of FDA-cleared devices have multi-center clinical data. Performance often degrades on external data due to scanner differences, protocol variations, and population shifts.

LEVEL 3
Clinical Utility

Does it actually improve patient outcomes? This is the hardest to prove. Requires randomized controlled trials measuring outcomes like survival, quality of life, treatment accuracy, or diagnostic errors prevented. Almost no brain tumor AI tools have reached this level of evidence. It is the frontier of the field.

⚠️
The algorithmic bias problem: AI models exhibit performance disparities across demographics, institutions, and scanner types. Chest X-ray AI classifiers were found to consistently underdiagnose underserved populations (female, Black, Hispanic patients). Brain tumor segmentation models show degraded performance on external hospital data. Models can infer race, sex, and age from medical images even when not provided as inputs, and may use these as “shortcuts” rather than learning true pathology.

FDA-Cleared Brain Tumor AI Tools

VBrain (Vysioneer Inc.)

FDA-cleared deep learning tool for brain tumor detection and contouring in stereotactic radiosurgery. Validated on 100 patients with 435 brain metastases: Dice 0.723 for whole tumor, 89.3% lesion-wise sensitivity overall (99.1% for lesions ≥10mm, 96.2% for ≥5mm), with 1.52 false positives per patient. Cleared via 510(k).

NeuroSAFE (Incepto Medical — CE-Marked)

Validated prospectively across 6 French centers on 459 consecutive patients. Sensitivity 90.2% vs 81.9% for neuroradiologists alone (P<0.001). Reduced false negatives by 47% as an AI-assisted second reader. Particularly beneficial for small metastases (<5mm). Demonstrates the value of prospective, multi-center validation beyond retrospective challenge benchmarks.

Note how these Dice scores (0.72) are significantly lower than BraTS challenge numbers (0.90+). This reflects the gap between curated challenge data and real-world clinical data with diverse scanners, protocols, and tumor presentations. It’s a reminder that challenge performance is a starting point, not the destination.

Integration Into the Radiology Workflow

A segmentation model is useless if it doesn’t fit into how radiologists actually work. Clinical radiology runs on DICOM (the universal medical image format) and PACS (Picture Archiving and Communication Systems). Your model must speak these languages.

How AI Fits Into the PACS Workflow

01
Image Acquisition

MRI scanner produces DICOM images and sends them to PACS. No AI involvement yet.

02
AI Processing

A DICOM listener detects the new study and routes it to a GPU server running the segmentation model. The model converts DICOM to NIfTI internally, runs inference, and converts the segmentation back to DICOM SEG or DICOM RT Structure Set format. Well-designed systems achieve 9–20 minute latency from scan completion to AI result, significantly faster than the 51–66 minute clinician latency.

03
Result Delivery

The AI segmentation is sent back to PACS and appears as an overlay on the radiologist’s viewer. It can also feed structured reports via HL7 FHIR, trigger notifications for abnormal findings, or populate worklist priorities.

04
Radiologist Review

The radiologist views the AI segmentation, verifies or modifies it, and incorporates it into their clinical report. The AI assists but does not replace the human decision-maker.

Three AI Interaction Models

Triage

AI prioritizes the worklist, flagging potentially abnormal cases for faster review. The radiologist still reads every study. Lowest risk, clearest workflow benefit.

Concurrent Reader

AI results shown alongside images while the radiologist reads. Most common for FDA-cleared devices. Risk of automation bias — radiologists may anchor to AI predictions.

Second Reader

Radiologist makes initial assessment, then sees AI output. Reduces automation bias but adds time. Studies show AI as second reader reduced false negatives by 47% for brain metastases.

Deployment Frameworks & Tools

MONAI Deploy

The end of the MONAI ecosystem pipeline: MONAI Label (annotation) → MONAI Core (training) → MONAI Deploy (clinical deployment). Deploy packages your trained model as a MONAI Application Package (MAP) — a Docker container with the model, preprocessing, inference pipeline, and DICOM I/O. The App SDK provides tools for building clinical-grade inference, and informatics gateways handle DICOM routing and HL7 FHIR integration. This is the closest thing to a standardized path from research model to clinical tool.

Docker Containerization

Docker is the foundation of reproducible AI deployment. Your model, Python environment, CUDA drivers, and all dependencies are packaged into a container that runs identically on any server. NVIDIA GPU support via the NVIDIA Container Toolkit enables GPU-accelerated inference inside Docker. MONAI Application Packages are Docker containers underneath. For regulatory compliance, containers provide version control and reproducibility — you can prove exactly which model version produced any given prediction.

Streamlit / Gradio (For Research Prototypes)

Before building a full PACS-integrated system, many researchers build lightweight web demos using Streamlit or Gradio. Upload a NIfTI file, see the segmentation overlaid in the browser. These are excellent for collaborator demos, grant applications, and proof-of-concept validation, but they are not suitable for clinical deployment — they lack DICOM integration, don’t meet regulatory requirements, and aren’t designed for clinical security standards.

Open-Source PACS Pipelines

For institutions without commercial AI platforms, open-source options exist. An Orthanc-based vendor-agnostic pipeline demonstrated feasible low-cost AI integration. DIANA (DICOM Image ANalysis and Archive) orchestrates AI workflows from study detection to clinician notification. One PACS-integrated implementation at Yale achieved real-time brain tumor segmentation in 4 seconds with radiomic feature extraction in 5.8 seconds.

Automation Bias: When AI Help Becomes Harmful

Here is the most uncomfortable finding in the radiology AI literature: in some studies, AI alone outperformed radiologists who were using AI. Human-AI interaction can paradoxically degrade performance compared to either alone. This is automation bias — the tendency to over-trust automated suggestions.

The Mammography Study (Radiology, 2023)

The most rigorous study of automation bias in radiology: when AI provided correct suggestions, radiologists scored 79.7–82.3% accuracy. When AI provided incorrect suggestions, accuracy dropped to 19.8–45.5% (P<0.001). All experience levels were susceptible, though inexperienced radiologists showed significantly greater bias toward incorrect high-severity suggestions. The effect sizes were massive (η²=0.93–0.97).

💡
Positive evidence for brain tumors: A randomized multi-reader evaluation of AI-assisted brain tumor segmentation for stereotactic radiosurgery found that AI assistance increased inter-reader agreement (DSC from 0.86 to 0.90, P<0.001), improved lesion detection sensitivity (82.6% → 91.3%), and reduced contouring time by 30.8%. Less-experienced physicians gained more accuracy improvement, while specialists gained more time savings. The key: appropriate integration design matters enormously.

Mitigation Strategies

Reducing automation bias requires designing the interaction, not just the algorithm: sequential reading (radiologist assesses first, then sees AI) reduces anchoring to AI predictions. Calibrated confidence displays show uncertainty alongside predictions so clinicians know when the model is guessing. Training programs that teach AI failure modes help clinicians maintain appropriate skepticism. Friction-based design requires active engagement rather than passive acceptance of AI output.

Clinical Impact Beyond Efficiency

Treatment Planning

Automated segmentation transforms radiation therapy planning. AI-generated tumor contours provide a consistent starting point that reduces inter-practitioner variability — a major source of treatment inconsistency. For stereotactic radiosurgery, AI contouring is fast enough (4 seconds) to enable real-time treatment plan iteration, improving dose targeting precision.

Volumetric Monitoring (RANO)

RANO 2.0 now supports volumetric tumor measurements alongside traditional 2D measurements. But manual volumetric assessment is impractical without automation. AI segmentation makes longitudinal volume tracking feasible, enabling earlier detection of progression and more objective treatment response assessment. One study showed that 2D RANO measurements had only 21% accuracy compared to volumetric ground truth for lower-grade gliomas.

Radiogenomics

AI segmentation enables extraction of radiomic features that can predict molecular tumor characteristics non-invasively: tumor grading, molecular subtype prediction (IDH mutation, MGMT methylation), survival prediction, and distinguishing pseudoprogression from true recurrence. This could potentially reduce reliance on invasive biopsies and accelerate time to molecular diagnosis.

Pediatric Brain Tumors

Volumetric consistency is particularly important for pediatric neuro-oncology, where inter-rater variability in manual measurements can obscure treatment response. AI-based segmentation provides a standardized approach that reduces this variability, supporting more reliable longitudinal monitoring during a child’s treatment.

Ethics, Transparency & Liability

Explainability: The “Black Box” Problem

A U-Net that outputs a segmentation mask doesn’t tell you why it drew that boundary. Methods exist for making models more interpretable: Grad-CAM/Grad-CAM++ highlights which input regions most influenced the prediction, attention maps from transformer architectures show where the model focused, and uncertainty visualization (via TTA or Monte Carlo dropout) shows where the model is unsure. Evidence on whether explainability improves clinical decisions is mixed — it may be more valuable for detecting model failures and data drift than for improving individual predictions.

Informed Consent

When should patients be told AI was involved in their diagnosis? A proposed framework evaluates AI model autonomy, departure from standard practice, whether AI is patient-facing, and clinical risk. Low-risk, non-patient-facing uses (like AI helping a radiologist contour a tumor faster) may require no specific consent. Higher-risk or more autonomous applications may require notification or formal consent. The field is still evolving.

Liability

Who is responsible when AI contributes to an error? Currently, healthcare professionals retain full accountability for decisions involving AI — the AI is legally a tool, not a decision-maker. But liability is distributed across developers, clinicians, healthcare organizations, and patients in complex ways. The dynamic, learning nature of AI complicates traditional product liability definitions. Multiple legal frameworks have been proposed (risk-based liability, strict liability, regulatory sandboxes), but none have been widely adopted.

Equity

If your model was trained primarily on data from academic medical centers with 3T scanners, it may fail at community hospitals with 1.5T scanners. If the training data underrepresents certain demographics, the model may perform worse for those patients. Ensuring AI works equitably requires diverse training data, demographic-stratified evaluation, and post-deployment monitoring for performance disparities. This is not just an ethical obligation — it’s increasingly a regulatory requirement under both the EU AI Act and evolving FDA guidance.

Challenges & Barriers to Adoption

Data Drift: Models Decay Over Time

An AI model trained in 2023 may not work in 2025. Scanner software updates change image characteristics. New treatment protocols alter what tumors look like. Patient populations shift. A comprehensive analysis of 8 clinical AI models found that COVID-19 caused performance shifts of up to ΔAUROC 0.44. Solutions include label-agnostic drift monitoring (detecting shifts without needing new ground-truth labels), transfer learning (adapting to new sites), and continual learning (proactively updating models when drift is detected).

The “Last Mile” Problem

Despite 950+ FDA-cleared AI devices, clinical adoption remains limited. Published barriers include: unstructured implementation processes, lack of hospital-wide innovation strategies, uncertain clinical added value, large variance in trust among radiologists, poor interoperability with existing systems, information overload from AI outputs, medicolegal uncertainty, and insufficient AI literacy. The technology is often ready before the institution is.

Economics: Who Pays?

AI cost-effectiveness analyses look promising on paper: 85% of reviewed studies reported cost savings. But the analyses are often incomplete — only 28% included implementation costs, only 57% reported operational costs, and 63% evaluated AI at early development stages. Commonly overlooked costs include software licensing ($50K–$200K+/year), GPU hardware, integration engineering, clinician training, continuous monitoring, and regulatory compliance. Organizations often focus exclusively on financial returns while ignoring quality and safety benefits.

📚
What this means for you as a student: Building the model is the beginning, not the end. The skills that matter most for clinical translation are ones rarely taught in AI courses: understanding regulatory pathways, designing prospective validation studies, building DICOM-compatible infrastructure, managing data drift, and communicating with clinicians. If you’re serious about clinical impact, learn to think like a translational researcher, not just a machine learning engineer.

This Week’s Learning Resources

Essential Reading

The most important single paper on this topic. Covers the current state and future of AI in medical image interpretation, automation bias, clinical trial design, and the gap between technical and clinical validation. Required reading.
N Engl J Med. 2023;388:1981–1990
Landmark randomized trial: AI assistance improved detection sensitivity (82.6%→91.3%), inter-reader agreement (DSC 0.86→0.90), and reduced contouring time by 30.8%. The gold standard for brain tumor AI clinical evidence.
Neuro-Oncology. 2021;23(10):1713–1724
The study that proves automation bias is real and massive: accuracy dropped from 80% to 20–45% with incorrect AI suggestions. All experience levels were susceptible. Essential for understanding human-AI interaction risks.
Radiology. 2023;307(4):e222176
Practical primer on DICOM, IHE profiles, DICOMweb, HL7 FHIR, and how they connect to build clinical AI infrastructure. The technical roadmap for integration.
Radiology. 2024;310(3):e232174

Deployment Tools

Package your trained model as a clinical-grade containerized application. Includes App SDK, informatics gateway, and example application packages. The canonical path from MONAI training to clinical deployment.
Free, vendor-agnostic DICOM server that can serve as the backbone for research AI deployment. Used in published open-source AI integration pipelines.
Build quick interactive demos of your segmentation model. Not for clinical use, but invaluable for collaborator demonstrations, paper figures, and grant applications.

Deep Dives

AI chest X-ray classifiers systematically underdiagnose underserved populations. The paper that put algorithmic bias in medical AI on the map.
The most comprehensive study of data drift across 8 clinical AI models. Demonstrates label-agnostic monitoring, transfer learning, and continual learning as solutions.
Legal analysis of liability risk when AI contributes to clinical errors. Essential reading for understanding the medicolegal landscape.
The “last mile” problem from a health system perspective. Why technically excellent AI isn’t used clinically, and what it takes to change that.