Dataset Description
In this competition, you will design a single prompt to guide small language models (<70B parameters) in generating accurate, one-sentence medical diagnoses from clinical case reports.
100 cases were randomly selected from case reports published in The New England Journal of Medicine (NEJM) and The Journal of the American Medical Association (JAMA) journals between 2024 and 2025 as complex cases, and 100 cases were randomly selected from the MedQA database as common cases. Together, these form the dataset for this competition. These cases vary in complexity, presentation, and medical specialty, simulating the diagnostic challenges faced in clinical practice.
Each case is provided as a single, detailed narrative in natural language. Your task is to engineer a prompt that, when combined with each case report and fed into a language model, will elicit the most accurate one-sentence diagnosis.
Files
Case_Report.csv - A csv file containing the full set of clinical case reports.
Columns
id
- A unique identifier (number) for each case report.Cases
- The full narrative clinical case report describing a unique patient scenario.Diagnosis
- The one-sentence diagnosis (string) corresponding to the case report. This field should be filled in by participants based on the diagnosis generated by their prompt and model for each case.
Submission File
Your submission must be a CSV file with a header, containing a row for each clinical case report in the test set:
Tools and Resources
Designated LLM: A specific small language model (<70B parameters) has been designated for use in this competition. Participants are required to use this model to ensure a fair comparison of prompt effectiveness.
Jupyter Notebook: To streamline your workflow, we have provided a ready-to-use Jupyter notebook. This notebook demonstrates how to load the dataset, apply your prompt to the cases, run the LLM, and format your output for submission.
Getting Started Guide: A step-by-step guide is included in the competition datasets. This guide walks you through accessing the dataset, using the Jupyter notebook, and submitting your results on Kaggle.
All required files, including the dataset, notebook, and guide, are available in the Competition Datasets section on Kaggle.
Evaluation
Submissions will be evaluated based on how closely your predicted diagnosis matches the ground truth for each case report. For each case, if your answer is sufficiently similar to the correct diagnosis, it will be considered a match.
Scoring Details
Each case is marked as a match or no match depending on the similarity between your answer and the ground truth.
Your final score is based on the match rate: the proportion of case reports where your answer is considered a match.
In cases of similar match rates among multiple teams, minor differences in answer similarity will be used to further differentiate score.
Answers are compared in a case-insensitive manner.
Leading/trailing whitespace is ignored.
Blank or missing answers will not be counted as a match.