

Ayman: Hey, everybody. I'm, I'm Anole, one of the Behind the Knife fellows. Welcome back to our series on all things artificial intelligence, where we go over just what you need to know as a clinician.
In our last episode, Dr. Geoff, Dr. Aggarwal, and myself discuss some basics of ai, what it is, common misconceptions, and an overview of machine learning, neural networks, and deep learning. Today we'll dive into a domain of AI that's transforming the way we interpret medical images among other things. This is computer vision.
So much more minimally invasive surgery Monitors, recordings. There's a large potential for vision models in surgery and elsewhere. So in this episode, we'll discuss the basics of how these vision models work, the impact they're already having, especially in fields like radiology and what the future holds, including multimodal models that can combine vision and language like G PT four's image understanding capabilities.
For this and our future AI episodes, I'm honored to be joined by a team out of Oregon Health and Science University. Today I'm with Dr. Julie Dober and Drbo Song. Dr. Dober
is a cardiothoracic surgeon and board certified informatician at OHSU, who is a digital health expert and uses AI techniques in her research.
Hey, Dr. Dover.
Doberne: Hey there. Happy to be here today.
Ayman: And we also have Dr. Song, who's a professor in biomedical data science with research interests in machine learning, and specifically in biomedical image computing. Her recent publications include using complex machine learning algorithms to detect differences in images of eye diseases, such as in infectious keratosis, fungal keratitis, and for cancer early detection.
Welcome, Dr. Song
Heather,
Ayman: and with that, let's get started. So I'll start with the definition of computer vision computer vision is a field within artificial intelligence that enables machines to interpret and understand the visual world. For example, by processing images like x-rays or MRIs or videos like real-time endoscopic footage, computers can perform tasks such as image classification, object
detection, or segmentation.
Yeah. I just want to emphasize this analysis can be done across different scales from microscopy to end anatomy imaging, across imaging modalities from CTMI, microscopy, OCT, and across time, such as in videos and longitudinal data.
Ayman: Perfect, and let's go back through some of the terms that we had just mentioned, and we can start with image classification here.
We're trying to determine what category an image may fall into in the context of medicine, or for example, a chest x-ray. This classification may be as simple as normal or abnormal. Of course, you can classify in more depth, but that's a typical starting point.
When we talk about normal and abnormal, that's a binary classification. First, the classification can be more granulated, including multiple classes, for instance, for different subtypes of cancer, for different risks, phenotypes.
Doberne: And in radiology and in surgery, knowing
abnormal versus normal is really important.
And this is where a large body of radiology IR programs have focused on helping triage, uh, reading lists for radiologists. For example, reading a head CT that is high priority in determining quickly whether it is. A normal head CT or an abnormal one, which warrants further imaging and workup. There are several off the shelf programs available now that can assist in this.
And they do everything, from analyzing the image to prioritizing the task list of the position, such as that abnormal head ct, to ensure that that gets read first, prior to other studies that are waiting to be read.
Ayman: Perfect. And next we'll talk about segmentation, which is a term that comes up frequently and refer to dividing an image into regions.
In other words, imagine a pathology slide of melanoma among normal tissue. Can you divide the image into what parts may be
melanoma and what is normal? That's really all segmentation refers to, but it is a term that arises quite frequently.
Doberne: Yeah, exactly. And there's a significant body of work out there looking at image classification and segmentation.
In laparoscopic surgery, researchers have trained computer programs to identify different tissue types and organs from a laparoscopic image, for example, knowing that this area of this image is physiologically distinct from this area of the image. So segmenting this tan area from this dark red area, classifying the ladder, for example, as the liver and the former as the large bowel, et cetera.
And this type of analysis has proven useful in quantifiably assessing the conduct of an operation such as a lab goal, and how long it took to dissect out the
critical view of safety.
I to echo what Dr. Dober said. In essence, segmentation tells you where the object or region of interest is. The boundary, its shape, and its mythology.
Classification on the hand tells you what it is. Sometimes segmentation and classification can be done together in one unified model.
Ayman: This is all great, and those are some of the terms that come up quite frequently, so it's good to kind of get through those. I think now we'll talk about some of the more specific methods, and first we'll talk about the deep learning ones.
They're complex and most, however, are built on neural networks, which we briefly went over in the last episode for a quick recall. Neural networks are a machine learning design where different nodes connect to others to produce an output. The strength of connections between nodes is weighted, and essentially that's what you're training with respect to vision.
The first category of models are convolutional neural networks. These are the backbone
of most image processing AI models, and they use convolutional layers to detect features like edges, shapes, and textures at increasing levels of complexity. Dr. Dorin, do you mind explaining a convolutional layer?
Doberne: Absolutely. Let's bridge this back to what you mentioned about neuro networks. Traditional neuro networks process data as flat lists of numbers, but images aren't flat in meaning. They have spatial relationships. That's where convolutional layers shine. Imagine holding a tiny flashlight that scans across an image looking for specific patterns like edges, curves, or textures.
Each convolutional layer uses dozens of these flashlights, we call them filters or kernels, each trained to detect a unique feature. Mathematically. A convolutional layer is essentially a function that slides a window filter or kernel across the input data and computes a weighted sum, which is then passed through a non-linear function.
What this
gives you is a map of the detected features in the input, and again here when we say we mean an edge, a shape, or texture, or higher level relations to something about an image that builds it. Deep learning is basically having many such convolutional layers stacked together to give the model the power to learn highly complex features.
Importantly, these filters are learned during the training process, and each one is meant to detect one feature, and when we say slides, we mean move through a certain amount of pixels, and after that it gets complicated.
Yeah, so just to give a clinical example of how that works, imagine that you have a patient with right lower quadrant pain and you're suspicious of some appendicitis. You look at the CT scan and as you quickly go up and down through it, you notice that there's some funniness in the right lower quadrant.
As you zoom in, you call it fat stranding, and you zoom in further and find the appendix and notice that it looks quite inflamed.
A CT scan is a two by two grid of numbers at the end of the day, and each number is like the intensity of the image at that point. A filter slides across this and when it finds that fuzziness that you found, when you scan through the CT scan, it'll hover up, light up and say, Hey, there's something here, and I notice that it's the same as what you otherwise called inflammation.
And that's essentially what a convolutional neural network is doing. Now, the next type of model that we'll talk about are vision transformers, which is something that applies transformer architecture to images. Some people may have heard of transformers mentioned in the context of large language models and in the vision world, and I'll think of it as important and a little bit different, and in contrast to a neural, a convolutional neural network, instead of using a filter.
Like a convolutional neural network, a vision transformer looks at patches.
Uh, yes. Video transformers are indeed shaking up the field and they are a fascinating pivot from
CNNs. Let's unpack this step by step. You are spot on about transformer architecture. It first revolutionized language models like GPTs by focusing on relationships between words in a sentence. Vision Transformers apply that same idea to images, but instead of words, they treat patches of the image as the tokens to analyze.
Imagine slicing a chest x-ray into a grid of smaller squares, say 16 by 16 pixels each, each patch is fed into the model, which then uses self attention to weigh how important each patch is relative to to others. Then how does the transformers differ from CNNs? CNNs use local filters to detect edges textures in the sliding window.
They're great at capturing local patterns, but can miss global context, vision transformers, on the other hand, analyze all patches at once.
They learn long, long range dependencies at the very beginning of the network from layer one.
So that gives a unique advantage to capture the global context. But what is a catch? Visual? Transformers are data hungry. Training them from scratch often requires massive data sets, which is challenging specialty medicine. Is why many researchers pre-train data transformers on nonmedical images such as natural images, and then fine tune these pre-train models on smaller medical data sets.
That's great, and I'm just going to relate back to that CT scan and appendicitis to explain the difference a little bit between a convolutional neural network and a vision transformer. A CNN is kind of like we took that magnifying glass, went right down to the fat stranding, found the edge of the appendix, and said, Hey, this looks like inflammation.
A transformer in this context may step back, look at the whole image and say, Hey, wait a second. The appendix is
inflamed, but so is the entire ileum. And it may call this something like a terminal ileitis instead, and look at the entire scanning context and say, given the entire set of findings that we have, this is more likely to be a Crohn's type picture.
And we're not seeing appendicitis, but rather we're seeing, uh, an acute flare of Crohn's disease. So. A convolutional neural network may miss that by taking a magnifying glass and looking just at that border between the appendix where you have some fat stranding, but a vision transformer will be able to take that piece of fat stranding and relate it to, at the same time the inflammation around the ileum that's far away from the appendix.
Uh, so that's, that's a little bit of how they differ clinically.
However, that being said, clinically we typically want both. We want the precision of A CNN with the reasoning of a transformer. In other words, you have to be able to find that tiny detail on a CT scan
and yet also put it into the entire clinical context. So that's what hybrid models are for, and it does exactly that.
It combines these two models and. Does it in order to catch both the tiny littlest fat stranding by the appendix, but also the contextual picture of what does the rest of the small bowel look like?
Ayman: So those are some basic fundamentals, but let's talk applications for just a moment to wrap up the episode. First, we'll talk about some nuances starting with findings in medicine that are extremely rare, which happens to be a lot of the things that we look at. For example, cancer. Cancer is only 1% or less of the data in a typical problem, like looking at chest x-rays.
This creates a problem called class imbalance, and it's important to discuss because it makes a difference for how you evaluate models. Take for example, that you're trying to train a model that categorizes or classifies chest x-rays and there's five cancer images out of 5,000. If you say negative or failed to detect cancer for all the cancer
cases, you're actually right 4,995 out of 5,000 times.
And I wanted to point this out here because it's a good time to reiterate that how results are reported matters. And another subtle point is that you always have to take into context what a model is trained to do. So for example, if your chest x-ray model is only trained to detect pneumonia and nothing else, then a person with cancer on their chest x-ray will be missed by the model because that was not the intention of the model being built or the training.
So Dr. Dobern, do you wanna talk about a couple solutions to some of these problems like class imbalance?
Doberne: Yeah, and this is also a really good time to just reemphasize that an AI model is only able to do what it is trained to do, and it should only be relied on to perform that task. This is a great opportunity to just reiterate how important it is to keep the human in the loop to ensure that a holistic, clinically
relevant interpretation is ultimately delivered by the clinician.
And a little bit more about how to deal with the class imbalance problem. So one solution, is to up sample rare cases and or down sample common cases. In a way, by doing that, we generate new data, which increasingly represents the rare cases, and that can be used to train a model. Also we can add in a cost function that penalizes the model in the errors that it makes on the rare cases.
For example, making missing a rare case 10 times worse than correctly, identifying a negative case.
Ayman: Pivoting back to applications. Radiology does have good examples and there are models out there that can read chest x-rays quite well and detect pathologies like checks net or models that can identify large vessel occlusions on strokes faster than clinician review, which can bring important studies to the top of a reading queue.
Dr. Song, any examples from
your own research that you'd like to talk about?
Sure. Happy to share some of the projects I am working on or have worked on. Uh. Project I worked on before was to use dynamic contrast enhanced MRI to predict treatment response of, uh, breast cancer patients, in how they responded to neoadjuvant chemotherapy.
Typically, the therapy would take months, several months, four months, four months, five months, um, bike. Looking at the. I contrast enhance I and analyzing the heterogeneity of the tumor. We can extract markers that allow us to predict who will be responding to this treatment, very early on, either at the baseline time or one week into treatment.
This will allow us to switch patients off the therapy if they're not gonna be responders. Another example is, precision oncology, where we
integrate Multiparametric, MRI with histopathology and molecular imaging, which is,, multiplex. Immunofluorescence, which is a spatial proteomics, uh, a type of speech proteomics use those combined multimodal data to risk stratified prostate cancer patients.
If they have early disease and to predict outcomes. So those are the two examples.
Doberne: That's really exciting research. And, some of the projects that I'm working on involve training computer vision model to analyze a chest x-ray to see if there is abnormal aortic pathology, and to evaluate whether infusing clinician expert.
Data, and the reading patterns of clinicians can augment the output of the computer vision model. Overall I'm really excited to see how as data capture and
integration improves, I. How computer vision can evolve and grow into other spaces. For example, the open procedural space as a cardiac surgeon, I'm really interested in seeing how computer vision can identify and measure intraoperative conduct in an open heart surgery case, for example, or to retrospectively analyze the workflow of a team during a code.
So lots of exciting stuff on the horizon.
Ayman: Yeah, definitely. And, this is all incredible work and I would love to hear more. But today we've only briefly touched the surface of vision models and, they are much more complex than we've laid out here. And there are models that can handle multiple sources of data like, text image and other sources.
Also, just a final point is that there are challenges that are specific division models, like how to acquire and label data, but there are also those that are common to all AI models like generalizability or
ethical and legal implications. Nonetheless, these models are exciting and they're here today and for the future.
And on our next episode we'll be talking about natural language processing and large language models specifically and that's all. And from behind the knife. Dominate the day. Dominate,
dominate the day.
Just think, one tiny step could transform your surgical journey!
Why not take that leap today?