After 5 months, the same medical vignette is back, and 12 out of 13 LLMs we tested still can't solve it. So why are we calling this a success? You have to read the whole study to find out. π
This was a super interesting article! But I suspect the differences you're seeing are mostly from specific product decisions in the chat app, rather than any differences between individual SOTA models.
Which especially makes sense in the context of healthcare! Sure, each individual doctor is smart and well trained, but the "intelligence" of the system lies in the org structure, context management, and processes:
1/ The triage nurse: The initial prompt engineering and intent routing.
2/ The chart and patient history: The context window + maybe RAG pipeline
3/ The differential diagnosis: The explicit agentic loop mapping out multi-step reasoning as well as the system prompt
4/ Lab techs and specialists: The external tool use and continuous verification.
Exactly, itβs about the right product success eval criteria (in this case, Accuracy and How it arrives at conclusion both matter a lot in Medical Diagnosis), training and eval,, architecture all are important when it comes to scalable solution.
Thank you both for sharing the thorough testing and thought-provoking analysis. Perhaps you could create a medical diagnosis benchmark arena with such medical 1st principles/reasoning and cases when Causality should be prioritized vs other metrics. Otherwise, LLMβs working mechanisms are basically pattern matching (probability). As you pointed out, MAD may not work if the right success criteria are not selected for the agents for different use purposes. Not suprised Google Gemini now has hardcoded temporal reasoning as they may have worked with most healthcare practitioners and recognized the proper success criteria for medical diagnosis.
I agree. We need a solid benchmark for medical diagnosis / clinical reasoning. As our exercise showed, strong pattern recognition alone may not be enough.
Btw, Iβm in AI product for a different industry vertical (retail) and have done quite some Causal Inference Reasoning ML dev in additional to traditional pattern-learning ML. Really interested in Care space and would like to explore potential partnership in this space as side project. if you are interest, let me know!
This was a super interesting article! But I suspect the differences you're seeing are mostly from specific product decisions in the chat app, rather than any differences between individual SOTA models.
Which especially makes sense in the context of healthcare! Sure, each individual doctor is smart and well trained, but the "intelligence" of the system lies in the org structure, context management, and processes:
1/ The triage nurse: The initial prompt engineering and intent routing.
2/ The chart and patient history: The context window + maybe RAG pipeline
3/ The differential diagnosis: The explicit agentic loop mapping out multi-step reasoning as well as the system prompt
4/ Lab techs and specialists: The external tool use and continuous verification.
Glad I use Gemini LOL
Personally, I donβt think the hard-coded override Gemini used in this particular case is a long-term solution. But hey, it worked here.
Exactly, itβs about the right product success eval criteria (in this case, Accuracy and How it arrives at conclusion both matter a lot in Medical Diagnosis), training and eval,, architecture all are important when it comes to scalable solution.
I don't ask it for a medical diagnosis, tbh
Thanks for your constructive vigilance in pushing AI to be a more meaningful tool to improve patient care.
π
Thank you both for sharing the thorough testing and thought-provoking analysis. Perhaps you could create a medical diagnosis benchmark arena with such medical 1st principles/reasoning and cases when Causality should be prioritized vs other metrics. Otherwise, LLMβs working mechanisms are basically pattern matching (probability). As you pointed out, MAD may not work if the right success criteria are not selected for the agents for different use purposes. Not suprised Google Gemini now has hardcoded temporal reasoning as they may have worked with most healthcare practitioners and recognized the proper success criteria for medical diagnosis.
I agree. We need a solid benchmark for medical diagnosis / clinical reasoning. As our exercise showed, strong pattern recognition alone may not be enough.
Btw, Iβm in AI product for a different industry vertical (retail) and have done quite some Causal Inference Reasoning ML dev in additional to traditional pattern-learning ML. Really interested in Care space and would like to explore potential partnership in this space as side project. if you are interest, let me know!
π₯