EuropeMedQA Study Protocol
A multilingual, multimodal benchmark to evaluate LLMs on official medical exams from Italy, France, Spain, and Portugal.
What EuropeMedQA is
Large Language Models shine on medical benchmarks in English. Outside English, performance collapses — and almost no benchmark exists to measure how much.
EuropeMedQA is the answer: a dataset built from the official regulatory medical exams of four European countries (Italy, France, Spain, Portugal), integrating text and diagnostic images to test multimodal models on the real ground of European clinical practice. The study follows FAIR principles and SPIRIT-AI guidelines, and uses strictly constrained zero-shot prompting to measure linguistic and visual reasoning across four languages in a comparable way.
The strategic goal is twofold: build a resource that is resistant to training-data contamination, and push the development of a medical AI that is genuinely adaptable to non-English clinical contexts.
My contribution
On the technical side I worked on:
- Data manipulation and normalization — building the unified schema that links questions, answer options, and diagnostic images across the four languages, ensuring that metrics were comparable without introducing format-induced bias.
- Inference with frontier models — the zero-shot evaluation pipeline across multiple LLMs/VLMs, including API key management and rate-limit handling to obtain reproducible results on the full dataset.
Why it matters
A model that wins in English but fails in Italian, French, Spanish, or Portuguese is not a usable medical system for Europe — it is an academic toy. EuropeMedQA is the first step in moving the conversation from average performance to linguistic equity in medical AI systems.