blood-dropblood-drop

Journal Review in Artificial Intelligence: Four Times Better Than Us

EP. 90822 min 33 s
Artificial Intelligence
Also available on:
Watch on:
You have probably seen recent headlines that Microsoft has developed an AI model that is 4x more accurate than humans at difficult diagnoses. It’s been published everywhere, AI is 80% accurate compared to a measly 20% human rate, and AI was cheaper too! Does this signal the end of the human physician? Is the title nothing more than clickbait? Or is the truth somewhere in-between? Join Behind the Knife fellow Ayman Ali and Dr. Adam Rodman from Beth Israel Deaconess/Harvard Medical School to discuss what this study means for our future.      

Studies:
Sequential Diagnosis with Large Language Models: https://arxiv.org/abs/2506.22405v1
METR study: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

Hosts:
Ayman Ali, MD
Ayman Ali is a Behind the Knife fellow and general surgery PGY-4 at Duke Hospital in his academic development time where he focuses on applications of data science and artificial intelligence to surgery. 

Adam Rodman, MD, MPH, FACP, @AdamRodmanMD
Dr. Rodman is an Assistant Professor and a practicing hospitalist at Beth Israel Deaconess Medical Center. He’s the Beth Israel Deaconess Medical Center Director of AI Programs. In addition, he’s the co-director of the Beth Israel Deaconess Medical Center iMED Initiative.
Podcast Link: http://bedside-rounds.org/

Please visit https://behindtheknife.org to access other high-yield surgical education podcasts, videos and more.  

If you liked this episode, check out our recent episodes here: https://app.behindtheknife.org/listen

AI Journal Club #3

Ayman:

[00:00:00]

Imagine behind the knife listeners that you're sitting in the ICU on your 19th nine in a row, and your patient was post update five from hartman's.

Procedure goes from a one to a four liter per minute requirement. You don't think anything of it. He's a big guy. He's sleeping without A-C-P-A-P, but then you get a notification from the hospital AI to consider a stat chest x-ray. Fine, you order it. Next thing you see is a massive right-sided hemo pneumothorax.

Somehow the AI caught the iatrogenic from this injury from the Central line place yesterday, which developed slower than you would've guessed. Is this possible? So welcome all I'm, IM one of the surgical education fellows at Behind The Knife. Welcome back to our AI Journal Club, and today's episode is about a really hot paper from Microsoft that went viral.

It's titled Sequential Diagnosis with Large Language Models. And the typical news headlines are about a revolutionary medical AI that outperforms physicians with diagnostic performance of 80% compared to unfortunately, a physician average of 20. So today we're gonna discuss what that means, what the implication of the hats for us as physicians, and a brief overview about the paper itself.

Now. Luckily, I'm

[00:01:00]

joined by Dr. Adam Rodman to help dissect the study and tell you what you need to know. Dr. Rodman is an assistant professor and practicing hospitalist and AI researcher at Beth Israel Decons Medical Center. Dr. Rodman, thank you for taking the time to join today.

Adam: It is my true pleasure.

Thank you for having me.

Thank you again. I'll start with a brief overview of the study. The title is Sequential Diagnosis With Large Leg, which Models, and When they say Sequential Diagnosis, they're referring to what we do every day, which is just a workup.

Ayman: So how do you refine your differential with an iterative process and get to the most likely ultimate diagnosis? . Now to quantitatively score physicians and the AI itself, the authors developed a sequential diagnosis benchmark. , and I think that's worth just a minute to explain because that's how they're grading everybody.

I. Now in this benchmark, they have multiple LMS that talk to each other, and each of them has a specific role. We call those different LMS that talk to each other agents. So when people

[00:02:00]

say MULTIGENIC modeling, they're talking about multiple L LMS each with a specific role. So they ask each other questions.

One reveals information to the clinician or the AI taking the test, and the other talks about the clinical findings. Now the goal is to reach a final diagnosis, at which point the correctness measured by the actual closeness to the real world diagnosis as well as the cost is calculated. now the LLM responsible for navigating this is also, um going to create synthetic findings if the user goes off track, and that's meant to not bias them or give them hints.

Anything you'd add to that about how the. How the system works. Dr. Robbin?

Adam: No, I think that's a very good description.

Ayman: Um, well these cases are the next point that I wanna talk about in this paper. They're all based on 304 negm clinical pathological conference cases. These cases are extremely complex and in the paper they mention that they can be common conditions, but um, you know, uh, the first couple that you look at, I,

[00:03:00]

I see neonatal hypoglycemia from a biologically active teratoma, which is a diagnosis. I would never have even heard it. Yeah. I see that

Adam: all the time. You know? Right, right.

Ayman: Yeah, exactly. It's one of those where I have no clue where, where that comes from and, uh, they're rare.

I. And, uh, the other one I saw was embryo em, embryonal, rhabdomyosarcoma of the Rip Peritoneal region. Um, I'm, you know, trying to piece together those words is difficult for us. So

Adam: the sad thing is, I, I remember reading that. CPC

Ayman: Yeah, and it's, it's, uh, it, they're memorable for that reason. But, uh, the paper does say that these are common diagnosis and I.

I think that's one point that I would, I would argue, I don't know what you

Adam: would

Ayman: say about that,

Adam: but they're not. They're not. I mean, I am, I am, but a general internist, but they are not diabetes common. Right. Ebr, embryonal or teratoma Uncommon.

Ayman: Very uncommon. So what they did was they tested humans, and these humans were US and UK physicians, a median of 12 years of experience, and they achieved an accuracy of 20% at a cost of

[00:04:00]

$3,000 per case.

So there are a couple caveats here. Humans were not allowed any external resources, meaning no Google, no large language models, and you can't phone a friend. Each human did about 36 cases, and there were about 21 physicians varying from hospitalists to primary care physicians. The LLM, however, functions much differently.

The LLM based model works with multiple simulated doctors who talk to each other, and they argue at the pros and cons of each next step of the iteration in diagnosis. And each of these LMS has a unique and distinct role that we'll get into later. That's a model. That's something I wish we had when talking about challenging cases.

It's what consultants or senior, uh, senior faculty members and mentors are for. Again, in this study, physicians were not allowed anything external. So I'll pause there and ask Dr. Rodden, what do you think about that?

Adam: Oh, you mean about the fact that humans were solving the case? So, so the justification for doing this, they had a, a pretty decent

[00:05:00]

justification, right?

Because these are published CPCs and they were worried that if they allowed humans to use resources, they could just look up the CPC, I think, I mean, it's necessary. To take the context of this in, in play, right? So the humans didn't get to use it. I don't think it's necessarily a knock against the study.

We always make design choices. What I will say is that Google, uh, they recently had a study published in Nature. Um, it's an old study. It's actually on a Palm two based model where they compared their own system against people, but they also gave a baseline of humans, humans using Google search, humans using the ai.

Um, and I think that methodology gives you perhaps a, a better comparison of what would happen if you, you know, if you start to add different resources. But to me that doesn't necessarily invalidate the study. It's just a design choice that I think was made for a, a val, an important reason 'cause they're using CPCs.

But we do have to keep that in mind when we're talking about the, you know, the magnitude of the difference.

Ayman: Yeah. And I, I think it's important for how the conclusions are portrayed. Yeah. Um, and I think that that's something that gets

[00:06:00]

lost in some of these headlines. Mm-hmm. Uh, because if you take, uh, most PCPs and you ask 'em about pediatric one in a million cases, I, I think it's fair that 20, they get 20%.

Adam: Well, the other thing that I would. Say, I think the numbers and the headlines really stripped the context of, uh, of the lar larger context, history of differential generators, which is, you know, AI is, I, I know you know this, but AI is not a new technology. We have been talking about differential generation, depending on how you define it.

We, we usually go back to 1959 with Ledley and Lested, and you can go back to 1982 and look at internist one, and if you want to use this. Solving CPCs is X times better than a doctor. Like we surpassed the AI being better than a doctor back in 1982, right? Um, so, uh, it, it's one of these things like it. It doesn't invalidate the study.

I actually think it's a very important study and it does something interesting. But for that headline value, it is not surprising that any sort of AI system, not even an LLM, even a

[00:07:00]

simple uh, Bayesian system or an expert system, could outperform us. And in fact, like if you look at Isabel Health, if you look at Depl, they've been outperforming humans for decades.

Ayman: Yeah. And I, I think that's a very important point. And uh, when I was reading it, I think. If anything, it didn't invalidate the, the need for humans. And if anything, it made it more important because, uh, you know, one, one of the points when you read this study is these are very, very beautiful cases. I mean, the, the, they're in negm for a reason.

They're structured very well. Um, they have the information that you need. But I think real life is much different.

Adam: Well, and they have, I, I think this is the important thing to talk about the data set. 'cause it's less like they're very hard and for benchmarking hard cases are actually good. Right? Right.

You wanna be able to see the difference. But one thing that makes CPCs unique is that they all have a pathological ground truth. In my field, like in internal medicine, we often do not have a pathological ground truth. Right. Especially in primary care. But even in hospital medicine, we often

[00:08:00]

don't know. We have a working diagnosis, a clinical diagnosis, but we don't have a definitive ground truth.

And I think that's, this is actually I, it's important to study sequential diagnosis, and I think the CPCs are very good for that. But most of what you and I do in the real world doesn't have a pathological ground truth at the end to compare against, or even when we get it. I mean, for surgery, you get it weeks later, right?

It doesn't affect what you're doing at the time.

Ayman: Right. Right. And I think sur surgery is a little unique and sometimes we have a little bit more certainty when you look inside, but a lot of the times you just don't know. And we don't know the origin of the disease necessarily. Um, we don't know why. And when you make the decision to

Adam: not operate, you don't know like you, right.

Exactly. Yeah. You only know on the ones where you make that decision to operate.

Ayman: Exactly. And, um, yeah, exactly. And I, I, I think the, the other thing that's important is, is extracting this information from patients. Is tough in real life. I don't, you know, I don't think it's as easy as, um, as they're presented in the scenarios.

And I think anybody that's talked to a patient in the ED will get that. The stories

[00:09:00]

are hard. They make no sense sometimes. Sometimes, right? You have alternating issues from family members, altering stories from family members, and, you know, trying to piece all that together is a challenge in and of itself.

And, , I guess one, one thing I wanted to, . Make a note on, on that scenario is that I think the human aspect is still ultimately extremely, extremely important. And I think about a lot of the times, have you ever seen that box in medical school training where it's uh, you know, common presentations of common diseases and Yeah.

And I think about that box and uh, for other people, the other part of that box is rare presentations of common and rare presentations of rare, and it's just a lot of the times in real life, as soon as we leave that common of common. It's tough. I think I, I think it's tough.

Adam: So, well, and, and LLMs are good at common presentations of common and they're also good at common presentations of rare, right.

They're not good at rare presentations of common or rare presentations of rare. So they have their own, uh, their own sort of biases.

Ayman: Exactly. And I, I guess my question, one of my questions to you is how do

[00:10:00]

you think we should test that in the future? What's it? Yeah. Oh, this is a great, good to test that,

Adam: right?

Um, so, uh, the answer is real, real clinical data, right? Like the right real case mix. Now, sequential diagnosis is gonna be really hard to test on the real, real case mixes. Why? Because. It's a, um, what do you call it? It's like an all, it's like a choose your own adventure game. And in real life you don't know what the tests you didn't order would've showed, right?

So sequential diagnosis is, is really, really important, right? If you're actually talking about real diagnostic decision support, you have to test sequential diagnosis, which is why I think this paper is so important. It is important to understand how LLMs actually do go about selecting tests. But it's gonna be really hard to test this even in real patient data without running a clinical trial.

Because like I I, so I do studies with my own clinical data. We, uh, we recently, um, had one in our emergency room, and you don't know what you don't know. Like, I, I don't know what would've happened if you had gotten a CTA on this patient. If somebody wanted to do this, it's impossible to simulate. Right?

Ayman: Yeah. And

[00:11:00]

I That's a great point. Yeah. Like what, when we make a conscious decision not to do something. We have to live with the fact that we just don't have that information. Um, and you, and, and that's,

Adam: uh, I mean, I will say like, um, we, uh, just got a paper accepted in nim ai, uh, one of my brilliant, uh, my brilliant grad students, Aliya McCoy, where one of the things that we found is these very powerful reasoning models.

So like oh three is the model behind the, um, Microsoft, uh, multi-agent system. They, they. As a trade off for their improved performance, they lose the ability in the middle to have nuance. Uh, they, they become very extreme in their decisions. And you and I both know in real medical care, probably the decisions you and I agonize over more are not when we did something, but when we chose not to do something when we a hundred percent don't do the test.

When we don't do the surgery, when we say, you know, I think doing this will cause you more harm. And again, we need to test this. These are testable hypotheses. But from some of my own research, that seems to be the area that these reasoners. In particular suffer at.

Ayman: Yeah. Yeah. And I, I, I guess if you can expand on that a

[00:12:00]

little bit, how, what's, what's the way that we approach that in the future?

That particular problem

Adam: with LLMs or with humans? Yeah, with with

Ayman: lms, with LLMs.

Adam: Yeah. I, so, um, we're entering a really strange phase. So this, so first of all, I would say in general, like, I think we've gotten the context of this paper. It does not mean that Microsoft system is four times better at at, um, right at us.

It does mean that Microsoft system does show that it can do sequential diagnosis. Well, um, at least on the CPCs. And that question is how does that translate into the real world? Now, this is where the human computer interaction matters so much because I would make a guess and it's, we'll have to test it, but I would a priority guess that this thing is gonna do, going to over order tests, it's gonna have false certainty about the next test that need to do, which will benefit some patients well, but actually could cause harm to other patients.

. And then what does that mean for the human user? How does the human use, like, do we become just. Lose our critical thinking skills. Does it make humans run away from nuance? Uh, we don't know

[00:13:00]

the answer to us, uh, to that, but human computer interesting is tricky and these models have weird impacts on us.

So, yeah, I think I'm excited about the system. I want to see much more work done in sequential diagnosis. I want to see how this works on real clinical data, but at the end of the day. Like this system is not going, something like this is not gonna end up, it's not gonna be a doctor in a box. It's gonna be a human using it.

Right. And what does it mean for, for you as a surgeon or me as an internist working on the system? How does it change our decisions and does it change our decisions in a way that gets better care for our patients?

Ayman: Yeah, exactly. And you know, one, one thing that I think would be interesting, and uh, I'm sort of thinking out loud here, but I wonder.

what if you allowed physicians to use LLMs for these sequential diagnosis when you're comparing them? And I really am interested in that performance. So can a human outperform a purely machine, uh, purely LLM based model using an LLM as a tool?

Adam: You've designed a great study because that would be a super interesting addition to this study. What does the human do? How does the human change the performance of the system? Uh, in, in some of my,

[00:14:00]

I mean, this is why I get, you know, I've been a researcher for a while and I never got hate mail until like the last year and a half, but I've gotten some hate mail because a lot of my studies have shown that high performing LLM systems don't necessarily improve physicians.

But I wonder in that sequential diagnostic setup, would the LL would the human improve the LLM? Would it make it worse? It, we don't know it, but it's a really interesting question.

Ayman: Yeah, I think that would be very fun. And, um, you know, on that point, the other, the o the other thing that I wonder about is these elements are so sensitive to the stuff that you put into them and the way you phrase a question.

And we see that with prompt engineering. Perhaps you saw the GR news recently where one line in the, in the prompt engineering caused some, uh, extremist behavior. Yeah. Well, well, uh,

Adam: we'll say, we'll, uh, we'll say that.

Ayman: Yeah. So, um, very fortunate extremist behavior. Now, uh, that, that being said, it's, it's gonna be interesting to see is there differences in how.

Physicians use it and how they phrase the question to the model. So, um, but that's just a future thing I, that I think, yeah, I mean, this will be mean. It's just to

Adam: like some of the things that my research group is looking into. Do we, so one of

[00:15:00]

the challenges I think, as you know, with, with language models is they're very sick of fantic.

Uh, and physicians, when they use them, have a tendency to love it when it agrees with them and disregard it when it. Disagrees with them. And the challenge is, it's the doc when, when you put information into an LLM, you're putting in the orders and tests and information you have collected and the LLM can grok, uh, what you're thinking, right?

It knows there, there're like, if it sees the test, it knows what I'm thinking and I may love it because therefore it just tells me what I'm already thinking. I suspect that. Effective systems are going to kind of take the human out of that loop that will have either collecting information directly from patients or we'll have some sort of knowledge graph that extracts and organizes information in the chart that gets fed into the LLM.

Because part of the, you know, we have to also think about the effects on us and LLMs are weird. LLMs change the way that humans think.

Ayman: Oh, absolutely. Absolutely. And I think that we are gonna see a lot of evidence of that

[00:16:00]

soon. Very soon.

Adam: Well, did you see the meter study today on, on, um, on coders?

Ayman: I didn't.

What's the,

Adam: yeah, so, uh, METR meter is a wonderful, uh, research organization and they released a study looking at claims that vi like AI based coding tools, if they actually made people more efficient. And they ran a randomized control trial similar to the one that we did, uh, last year. And they, what they found is that.

Um, even though the experienced coders, uh, they all expected that they would be much more efficient with AI tools. I think 24, uh, 20% of them thought they'd be more efficient or 20, they thought they'd be 20% more efficient. And, um, after they did the study, they said, oh, we were 20% more efficient. In reality, they were almost 20% less efficient using the AI tools actually made them much less efficient, even though they had the perception that it was making them more efficient.

Um, so it just like the, the human computer interaction component is so important to actually seeing benefits and it's outside of an RCT. It's hard to study 'cause benchmarks without a person

[00:17:00]

don't tell you what it's gonna do to people.

Ayman: Exactly. Yeah. I couldn't, that's, that's an amazing thing. I'm gonna read that.

That's, uh,

Adam: I'll send it to you.

Ayman: No, I appreciate it. And, and you know, as someone that uses LLMs frequently to code I, you can see, I can see where you get that finding from. Um, especially in someone who's working in an environment that they're already extremely familiar with. I can see how you get, uh, falsely led in a direction with AI generated code and that just ends up costing you time.

So I can see that, um, I can see that.

Adam: Yeah. And it's, yeah, and it's, it's consistent with right now all of the studies done in medicine, which have all unfortunately shown a very similar thing, which is even in high performing algorithms, you might get the human a little bit better, but you never get it as good as the algorithm itself.

And in some of the imaging algorithms like it, you, you really get degradation, uh, uh, by using the AI in some of these imaging studies.

Okay. Right. And you know, we could probably talk for an hour on that topic alone, but what should we

[00:18:00]

say about final comments to wrap up the discussion of this study?

Adam: Yeah, I mean, so like, what I would want doctors to know is do not read the study and freak out about the headline. , Microsoft has not made a system that is four times better than them. , the numbers have meaning either they're not meaningless, but it needs to be. . Taken into context about what we're talking about, and this is, this system has nothing to do with replacing your jobs.

Uh, but studying sequential diagnosis is really important. I think this is a cool system. Um, you know, the, uh, this multi-agent combination of experts, we'll have to see how it works. I don't know if it works any better than just the base model, which is O three, a very powerful model by itself, but maybe it does.

And studying sequential diagnosis is really one of the next things we need to start looking in when we're talking about clinical decision support. . Traditionally, like we, we've never gotten to this area before because no one has ever had, , AI algorithms like naive Bayesian algorithms or expert systems never performed this well.

So we never had to worry about this. So we're kind of in. New territory. And I think it's really cool that we're in this

[00:19:00]

territory, but the general public should not read this and doctors should not read this study and say, oh God, Microsoft has invented something that is going to replace us. It is a cool and important research study.

Um, and it doctors are probably never, definitely, surgeons are never gonna be replaced. So surgical audience, but even good old fashioned internists like me are a very long way from being replaced.

Ayman: Yeah. And, I think, , when I first read this and I saw the multimodal, uh, approach, the first thought that came to my head is, uh, just imagine if we had five consultants that actually could talk to each other, who were going through

every

Adam: CPC together.

Yeah. I mean, that's

Ayman: exactly,

Adam: but I like, again, it's fine. It's fine. Um, I think the problem when you have a human baseline is that there's a temptation to say, okay, the model is this much better than the human right. But this is not a task that we would ever expect a human clinician to do, let alone like a generalist.

Ayman: Right, exactly. And, and I couldn't put it better. Um, but that's, that's all we have for today. And I, I mean, I can't thank you enough for your time. , I learned a lot and I have some great papers to

[00:20:00]

read, which is amazing. Oh, sorry. I have a

Adam: tendency to do that.

Well, thank you again, Dr. Rodman and from Behind. The Knife dominate the day.

Ready to dominate the day?

Just think, one tiny step could transform your surgical journey!
Why not take that leap today?

Get started