After 5 months, the same medical vignette is back, and 12 out of 13 LLMs we tested still can't solve it. So why are we calling this a success? You have to read the whole study to find out. 😉
This was a super interesting article! But I suspect the differences you're seeing are mostly from specific product decisions in the chat app, rather than any differences between individual SOTA models.
Which especially makes sense in the context of healthcare! Sure, each individual doctor is smart and well trained, but the "intelligence" of the system lies in the org structure, context management, and processes:
1/ The triage nurse: The initial prompt engineering and intent routing.
2/ The chart and patient history: The context window + maybe RAG pipeline
3/ The differential diagnosis: The explicit agentic loop mapping out multi-step reasoning as well as the system prompt
4/ Lab techs and specialists: The external tool use and continuous verification.
overall. interesting. big fan of the stack. solid effort with Dr Farag and making few valid points. however was difficult to get through the article during my read. the main issue I see if you fall into few medical student level fallacies. the ''right answer'' from clinical case studies you're 'emphysematous chole - because ruq pain and fever, and PAD because aki and fingers cold'' are ok for medical student board prep pearls but are not how pulmonary and critical care medicine is done and not a fair LLM test. trash in trash out. it reminds me of poorly written test questions on USMLE ,that reflexively say ' if Bubba and his dog boss have a cold -- right answer is it can only be blastomycosis. '' not real life and not a useful model test. the ACEi entrappment on timing also falls short. overall keep up the efforts but validation modeling would be best served in complex differential developement and not one reflexive answers on a poorly writtent multiple choice shelf exam.
Exactly, it’s about the right product success eval criteria (in this case, Accuracy and How it arrives at conclusion both matter a lot in Medical Diagnosis), training and eval,, architecture all are important when it comes to scalable solution.
Thank you both for sharing the thorough testing and thought-provoking analysis. Perhaps you could create a medical diagnosis benchmark arena with such medical 1st principles/reasoning and cases when Causality should be prioritized vs other metrics. Otherwise, LLM’s working mechanisms are basically pattern matching (probability). As you pointed out, MAD may not work if the right success criteria are not selected for the agents for different use purposes. Not suprised Google Gemini now has hardcoded temporal reasoning as they may have worked with most healthcare practitioners and recognized the proper success criteria for medical diagnosis.
I agree. We need a solid benchmark for medical diagnosis / clinical reasoning. As our exercise showed, strong pattern recognition alone may not be enough.
Btw, I’m in AI product for a different industry vertical (retail) and have done quite some Causal Inference Reasoning ML dev in additional to traditional pattern-learning ML. Really interested in Care space and would like to explore potential partnership in this space as side project. if you are interest, let me know!
This was a super interesting article! But I suspect the differences you're seeing are mostly from specific product decisions in the chat app, rather than any differences between individual SOTA models.
Which especially makes sense in the context of healthcare! Sure, each individual doctor is smart and well trained, but the "intelligence" of the system lies in the org structure, context management, and processes:
1/ The triage nurse: The initial prompt engineering and intent routing.
2/ The chart and patient history: The context window + maybe RAG pipeline
3/ The differential diagnosis: The explicit agentic loop mapping out multi-step reasoning as well as the system prompt
4/ Lab techs and specialists: The external tool use and continuous verification.
tldr: meh.
overall. interesting. big fan of the stack. solid effort with Dr Farag and making few valid points. however was difficult to get through the article during my read. the main issue I see if you fall into few medical student level fallacies. the ''right answer'' from clinical case studies you're 'emphysematous chole - because ruq pain and fever, and PAD because aki and fingers cold'' are ok for medical student board prep pearls but are not how pulmonary and critical care medicine is done and not a fair LLM test. trash in trash out. it reminds me of poorly written test questions on USMLE ,that reflexively say ' if Bubba and his dog boss have a cold -- right answer is it can only be blastomycosis. '' not real life and not a useful model test. the ACEi entrappment on timing also falls short. overall keep up the efforts but validation modeling would be best served in complex differential developement and not one reflexive answers on a poorly writtent multiple choice shelf exam.
this is fantastic thank you
Merci. I'll share something soon focused more on the larger implication which is a bit concerning.
Glad I use Gemini LOL
Personally, I don’t think the hard-coded override Gemini used in this particular case is a long-term solution. But hey, it worked here.
Exactly, it’s about the right product success eval criteria (in this case, Accuracy and How it arrives at conclusion both matter a lot in Medical Diagnosis), training and eval,, architecture all are important when it comes to scalable solution.
I don't ask it for a medical diagnosis, tbh
Thanks for your constructive vigilance in pushing AI to be a more meaningful tool to improve patient care.
🙏
Thank you both for sharing the thorough testing and thought-provoking analysis. Perhaps you could create a medical diagnosis benchmark arena with such medical 1st principles/reasoning and cases when Causality should be prioritized vs other metrics. Otherwise, LLM’s working mechanisms are basically pattern matching (probability). As you pointed out, MAD may not work if the right success criteria are not selected for the agents for different use purposes. Not suprised Google Gemini now has hardcoded temporal reasoning as they may have worked with most healthcare practitioners and recognized the proper success criteria for medical diagnosis.
I agree. We need a solid benchmark for medical diagnosis / clinical reasoning. As our exercise showed, strong pattern recognition alone may not be enough.
Btw, I’m in AI product for a different industry vertical (retail) and have done quite some Causal Inference Reasoning ML dev in additional to traditional pattern-learning ML. Really interested in Care space and would like to explore potential partnership in this space as side project. if you are interest, let me know!
🔥