Artificial Intelligence Faces Challenges Differentiating between Anatomical Directions in Medical Imaging
In a recent study, researchers found that state-of-the-art vision-language models, including GPT-4o and Pixtral, struggle to accurately identify the precise anatomical location of organs in medical scans, particularly when images are flipped, rotated, or deviate from normal anatomical presentations.
The study, which was conducted using the Beyond the Cranial Vault (BTCV) and Abdominal Multi-Organ Segmentation (AMOS) datasets, tested four vision-language models: GPT-4o, Llama3.2, Pixtral, and DeepSeek's JanusPro.
The findings suggest that these models tend to rely on prior anatomical knowledge rather than visual inspection, which reduces their ability to detect unusual or aberrant anatomy, a crucial skill in diagnostic medicine. Moreover, the models struggle with image orientation and contextual cues, leading to inaccuracies when the images lack clear anatomical markers.
The study also found that the models' performance varies by body region and image type. For instance, GPT-4o shows more accurate readings in some areas like the abdomen compared to others, such as the pelvis. The effectiveness of these models also relies heavily on sufficient context and training for specific medical imaging tasks.
Another concerning aspect is that these models may provide highly accurate outputs under ideal conditions but fail abruptly without sufficient contextual cues or in less common cases, limiting their reliability in clinical decision-making.
Despite these limitations, the study highlights ongoing improvements in diagnostic accuracy and the expanding utility of AI in early detection and patient screening contexts. However, it is crucial to note that AI still cannot fully replace expert human interpretation and oversight.
To address the issue of AI models relying on prior knowledge instead of image content, the researchers developed a dataset called Medial Imaging Relative Positioning (MIRP). The dataset contains an equal number of yes and no answers, with anatomical structures optionally marked for clarity.
The study repeated the experiments using CT slices annotated with letters, numbers, or red and blue dots, and adjusted the question format to reference these markers. Results showed that all models had an average accuracy near 50 percent, indicating performance at the chance level, and an inability to reliably judge relative positions without visual markers.
GPT-4o was the overall top performer, with Pixtral leading among open-source models when letter or number markers were used. On the other hand, JanusPro and Llama3.2 saw little to no benefit from these visual markers.
The study also found that these models may not be reading uploaded PDFs or looking at images at all, but instead making assumptions based on prompts. This raises concerns about the accuracy of their responses and the need for careful clinical validation and human-AI collaboration in medical imaging applications.
The study revealed that models like GPT-4o and Pixtral struggle to accurately locate organs in medical scans, relying more on prior knowledge rather than visual inspection, which hinders their ability to detect unusual anatomy critical in diagnosing medical-conditions. The researchers also noted that the models' performance varies by body region, image type, and orientation, highlighting the need for artificial-intelligence models to improve contextual cues and interpretative ability in health-and-wellness contexts.
Despite the limitations, advanced AI technologies continue to show promise in early detection and patient screening, as demonstrated by the ongoing improvements in diagnostic accuracy. However, the results stress the importance of human oversight and interpretation to ensure accurate clinical decision-making.