14 Comments
User's avatar
Josh's avatar

This was a super interesting article! But I suspect the differences you're seeing are mostly from specific product decisions in the chat app, rather than any differences between individual SOTA models.

Which especially makes sense in the context of healthcare! Sure, each individual doctor is smart and well trained, but the "intelligence" of the system lies in the org structure, context management, and processes:

1/ The triage nurse: The initial prompt engineering and intent routing.

2/ The chart and patient history: The context window + maybe RAG pipeline

3/ The differential diagnosis: The explicit agentic loop mapping out multi-step reasoning as well as the system prompt

4/ Lab techs and specialists: The external tool use and continuous verification.

Aliaks Ramaniuk's avatar

tldr: meh.

overall. interesting. big fan of the stack. solid effort with Dr Farag and making few valid points. however was difficult to get through the article during my read. the main issue I see if you fall into few medical student level fallacies. the ''right answer'' from clinical case studies you're 'emphysematous chole - because ruq pain and fever, and PAD because aki and fingers cold'' are ok for medical student board prep pearls but are not how pulmonary and critical care medicine is done and not a fair LLM test. trash in trash out. it reminds me of poorly written test questions on USMLE ,that reflexively say ' if Bubba and his dog boss have a cold -- right answer is it can only be blastomycosis. '' not real life and not a useful model test. the ACEi entrappment on timing also falls short. overall keep up the efforts but validation modeling would be best served in complex differential developement and not one reflexive answers on a poorly writtent multiple choice shelf exam.

Ruby Wang's avatar

this is fantastic thank you

Peter Farag's avatar

Merci. I'll share something soon focused more on the larger implication which is a bit concerning.

Thomas W. Dinsmore's avatar

Glad I use Gemini LOL

Sergei Polevikov's avatar

Personally, I don’t think the hard-coded override Gemini used in this particular case is a long-term solution. But hey, it worked here.

Martian's avatar

Exactly, it’s about the right product success eval criteria (in this case, Accuracy and How it arrives at conclusion both matter a lot in Medical Diagnosis), training and eval,, architecture all are important when it comes to scalable solution.

Thomas W. Dinsmore's avatar

I don't ask it for a medical diagnosis, tbh

Peter Frishauf's avatar

Thanks for your constructive vigilance in pushing AI to be a more meaningful tool to improve patient care.

Martian's avatar

Thank you both for sharing the thorough testing and thought-provoking analysis. Perhaps you could create a medical diagnosis benchmark arena with such medical 1st principles/reasoning and cases when Causality should be prioritized vs other metrics. Otherwise, LLM’s working mechanisms are basically pattern matching (probability). As you pointed out, MAD may not work if the right success criteria are not selected for the agents for different use purposes. Not suprised Google Gemini now has hardcoded temporal reasoning as they may have worked with most healthcare practitioners and recognized the proper success criteria for medical diagnosis.

Sergei Polevikov's avatar

I agree. We need a solid benchmark for medical diagnosis / clinical reasoning. As our exercise showed, strong pattern recognition alone may not be enough.

Martian's avatar

Btw, I’m in AI product for a different industry vertical (retail) and have done quite some Causal Inference Reasoning ML dev in additional to traditional pattern-learning ML. Really interested in Care space and would like to explore potential partnership in this space as side project. if you are interest, let me know!

Andrew Harrison's avatar

🔥