
Are Large Vision Language Models Truly Grounded in Medical Images?
Evidence from Italian clinical Visual Question Answering: do VLMs actually look at the image, or are they just guessing?
The suspicion
Frontier Vision Language Models — Claude Sonnet 4.5, GPT-4o, GPT-5-mini, Gemini 2.0 — score impressively on Medical Visual Question Answering benchmarks. They appear to read radiographs, recognize skin lesions, interpret ultrasounds.
But there is an awkward question the community has mostly avoided: do those scores actually come from the image, or is the image just stage scenery? A model might be answering correctly by exploiting linguistic cues in the question, textual knowledge memorized during pre-training, or statistical patterns over plausible answers — without ever really looking at the medical pixels in front of it.
It is the clinical version of the Clever Hans paradox: the horse that seemed able to count, and was in fact reading its owner's posture.
What we did
We built a clinical question-answering dataset in Italian, a deliberate choice: moving the language out of English makes it harder for the model to "get away with it" by leaning on training-time patterns. We then ran the four frontier VLMs through a protocol designed to separate what the model knows from what the model sees.
The idea is simple and brutal: if a model is truly grounded on the image, perturbing the image must perturb the answer. If the answer does not change, the image was never really read.
The open mysteries
Three things do not add up and deserve further work:
-
Asymmetry between models. Not all VLMs behave the same way under visual perturbation. Some collapse predictably when the image is masked or swapped; others stay stable — and that stability is the most suspicious signal, not the most reassuring one.
-
The language gap. Italian performance does not scale linearly from English performance. It remains unclear whether the gap is in prompt comprehension, in domain-specific clinical vocabulary, or in something deeper in the vision-language fusion pipeline.
-
What "looking" actually means. There is a continuum between full grounding (the model uses the image as primary source) and zero grounding (the image is ignored). Today's models live in a grey zone in between — they use the image, but opportunistically, like a student who only sneaks a glance when the question is hard.
Hypotheses under test
- VLMs may exploit textual shortcuts in the questions (demographic keywords, anatomical hints in the prompt) to derive the answer before consulting the image.
- They may carry very strong medical priors learned from text corpora (e.g. "lesion + age 65 + smoker → most likely diagnosis") that dominate the visual signal.
- Fine-tuning on English-centric benchmarks may have optimized them to recognize the format of the questions, not the content of the images.
None of these is conclusively proven. All are compatible with the numbers we observe.
Why this matters
The risk is not theoretical. If a VLM is deployed in a clinical context — even as triage support or a second opinion — and is not really using the image, then:
- Its answers are reliable only when the textual question is already informative enough to contain the answer. In the genuinely ambiguous cases (the ones the AI was needed for) it will fail.
- Errors will be systematic across subpopulations: patients whose textual data deviates from the "prototype" the model learned to answer without looking will get worse predictions.
- Aggregate benchmark metrics will mask this collapse, because on average the models look good.
This is the same pattern we see in gender bias on clinical predictive models: high average performance that hides distorted behavior on subgroups.
The takeaway
Before trusting a VLM in healthcare — or pitching it to an ethics board, a payer, a patient — the question to ask is not "what is the AUROC?", but:
"If I hide the image, does the answer change?"
If it barely does, you are not using a Vision Language Model. You are using a Language Model with a decorative picture next to it. And in medicine that difference is not a detail: it is the difference between a diagnostic system and a system reciting a part.