The ClinicalBench benchmark is manually created by three senior clinicians and two AI researchers. As shown in the below figure, the creation process covers 4 key steps, as follows. (1) The Data collection step focuses on authenticity, diversity, privacy. Based on department divisions and common disease types in each department, the medical team selects representative real cases for each disease from the hospital case database with permission for research. Given that these clinical case data is the private information of hospitals, the risk of data leakage to any LLMs is completely eliminated. (2) The Professional knowledge review step ensures the accuracy of the data. The team of doctors conducts a detailed professional review of the diagnostic information, treatment process, and results of each case to ensure the medical accuracy and proficiency of the data. (3) The Privacy protection and de-identification step ensures privacy protection. To protect patient privacy, the team of doctors conducts two rounds of independent reviews to identify and remove any content that could reveal patient identities, treatment regions, or other sensitive information. (4) The Data integrity and compliance check step aims for completeness and ethical compliance. Two AI researchers are responsible for reviewing the data to ensure that each record is complete, and meets the medical task requirements set. Additionally, they reconfirm that the dataset does not contain any sensitive information and strictly complies with the ethical guidelines.
ClinicalBench is a fine-grained evaluation benchmark specifically designed for multi-departmental clinical diagnosis, covering 24 departments such as pediatrics, orthopedics, and neurosurgery. The below figure presents detailed information about the various departments covered by ClinicalBench. It involves 150 different diseases, each comprising 10 specific cases, totaling 1500 samples, with an average of about 1000 tokens per case. To the best of our knowledge, ClinicalBench is the most comprehensive clinical diagnostics evaluation benchmark to date, covering the widest range of departments and diseases.
In the appendix of our paper, we present examples from the ClinicalBench dataset, including both Chinese and English versions. Please note that accessing the ClinicalBench dataset requires an application. If you wish to access the full dataset, please read the licensing documentation and submit an access request. We will send the data to your specified email address within 48 hours.