Large language models (LLMs) are increasingly evaluated for potential applications in medical education and clinical decision support, yet rigorous benchmarking remains essential to assess reliability and consistency in high-stakes medical knowledge domains. Dendritic Health AI developed Neural Consult, a proprietary large language model system designed to support medical education. This study reports the performance of Neural Consult on a benchmark set of U.S. Medical Licensing Examination (USMLE) question materials, with the objective of demonstrating reliability and accuracy across multiple examination levels.
Neural Consult was evaluated using questions sourced from a published USMLE examination dataset by Kung et al. The evaluation included 94 (Step 1), 109 (Step 2 CK), and 122 (Step 3).
All questions were provided to the model exactly as published, without modification, with questions containing images removed. Responses were generated using Dendritic Health AI's proprietary Neural Consult large language model (LLM) system, which incorporates the use of both proprietary and curated public medical datasets and information to enhance model accuracy above publicly available models. Minor formatting adjustments were made to Neural Consult's publicly accessible models to optimize for multiple-choice question answering. Top LLMs were run against the same question set for comparison using default settings. Model outputs were scored by human review for correctness of response and alignment with the official answer key.
In this evaluation, Neural Consult achieved a score of 100% on the benchmark questions included in each examination set. Performance outcomes were as follows:
Step 1 — 100% accuracy across 94 questions: Neural Consult Step 1 with explanations
Step 2 — 100% accuracy across 109 questions: Neural Consult Step 2 with explanations
Step 3 — 100% accuracy across 122 questions: Neural Consult Step 3 with explanations

Neural Consult demonstrated the capability of perfect accuracy on this USMLE benchmark question set across Step 1, Step 2, and Step 3 examinations in our evaluation, supporting the reliability of the model's reasoning and best-answer selection framework under benchmark testing conditions. These findings provide evidence of strong performance in structured medical knowledge assessment contexts and contribute to the growing body of benchmarking efforts for medically oriented LLM systems. It is important to note that heterogeneity exists in AI outputs and results can vary in repeated trials of accuracy, thus further study is needed to understand the degree of variance and inform methodologies to decrease it. Continued evaluation across additional datasets and testing environments will further support a comprehensive reliability assessment and external validity.
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. PMID: 36812645; PMCID: PMC9931230.