11 Comments
User's avatar
Josh's avatar

This was a super interesting article! But I suspect the differences you're seeing are mostly from specific product decisions in the chat app, rather than any differences between individual SOTA models.

Which especially makes sense in the context of healthcare! Sure, each individual doctor is smart and well trained, but the "intelligence" of the system lies in the org structure, context management, and processes:

1/ The triage nurse: The initial prompt engineering and intent routing.

2/ The chart and patient history: The context window + maybe RAG pipeline

3/ The differential diagnosis: The explicit agentic loop mapping out multi-step reasoning as well as the system prompt

4/ Lab techs and specialists: The external tool use and continuous verification.

Thomas W. Dinsmore's avatar

Glad I use Gemini LOL

Sergei Polevikov's avatar

Personally, I don’t think the hard-coded override Gemini used in this particular case is a long-term solution. But hey, it worked here.

Martian's avatar

Exactly, it’s about the right product success eval criteria (in this case, Accuracy and How it arrives at conclusion both matter a lot in Medical Diagnosis), training and eval,, architecture all are important when it comes to scalable solution.

Thomas W. Dinsmore's avatar

I don't ask it for a medical diagnosis, tbh

Peter Frishauf's avatar

Thanks for your constructive vigilance in pushing AI to be a more meaningful tool to improve patient care.

Martian's avatar

Thank you both for sharing the thorough testing and thought-provoking analysis. Perhaps you could create a medical diagnosis benchmark arena with such medical 1st principles/reasoning and cases when Causality should be prioritized vs other metrics. Otherwise, LLM’s working mechanisms are basically pattern matching (probability). As you pointed out, MAD may not work if the right success criteria are not selected for the agents for different use purposes. Not suprised Google Gemini now has hardcoded temporal reasoning as they may have worked with most healthcare practitioners and recognized the proper success criteria for medical diagnosis.

Sergei Polevikov's avatar

I agree. We need a solid benchmark for medical diagnosis / clinical reasoning. As our exercise showed, strong pattern recognition alone may not be enough.

Martian's avatar

Btw, I’m in AI product for a different industry vertical (retail) and have done quite some Causal Inference Reasoning ML dev in additional to traditional pattern-learning ML. Really interested in Care space and would like to explore potential partnership in this space as side project. if you are interest, let me know!

Andrew Harrison's avatar

πŸ”₯