Conversational Technologies
5 min read

Why GPT-4 Is Not the Robo-Nurse You’ve Been Praying For

Ziv Gidron Head of Content, Hyro
Why GPT-4 Is Not the Robo-Nurse You’ve Been Praying For

Here’s what you probably don’t need right now: another article covering every single reason GPT-4 is better than its predecessor GPT-3.5 (which serves as the LLM foundation for the “basic” version of ChatGPT), its exact capabilities, how to access it, or why this is important. You can find all of this information with a quick Google (or Bing AI) search. 

But just to swiftly get this out of the way:

  • GPT-4 has 1.2 trillion parameters, six times more than GPT-3.5’s 175 billion. More parameters equal more computational power and greater potential to learn complex patterns and generate diverse texts.

  • GPT-4 is a multimodal model, meaning it can process image inputs as well as text inputs. It can produce texts based on visual cues such as photos or drawings, describe what is happening in an image, and even write captions for memes (and what could possibly be more important than that?).
6422F0901705257Ec7Dd5E76 Fruywa1Waaelht0 Gzn9Yk
  • GPT-4 outperforms GPT-3.5 on various benchmarks that measure its ability to generate accurate and relevant texts for different tasks, such as question answering, natural language inference, and sentiment analysis. Impressively, GPT-4 even scored in the 90th percentile when tested on the US Bar Exam.


At this point, you may be rubbing your palms together, imagining the countless scenarios in which GPT-4 will swoop in to bridge your workforce gaps, chatting with your patients and refilling their prescriptions while your overextended team takes a much-deserved breather.

But, as much as I hate to be the bearer of bad news, the fact remains that until a myriad of core flaws in the very structure of Large Language Models (LLMs) are addressed, GPT-4 (or any others LLMs, for that matter) is still a long stretch away from the robo-nurse of your graveyard shift’s dreams.

GPT-4’s Healthcare Pitfalls

Albeit its genuinely jaw-dropping capabilities, GPT-4 suffers from the same shortcomings that rendered ChatGPT, and even its medical-focused “cousin” MedPaLM, inadequate for patient-facing duties. 

GPT-4 lacks three core components that would allow it to serve patients directly without risk: 

1. Real-Time Custom Knowledge

Although based on six times more parameters than GPT-3.5, GPT-4’s knowledge cutoff date is September 2021. To be able to perform patient-facing tasks like appointment scheduling or prescription refills, GPT-4 would require access to health system-specific data (which would be incredibly unsafe), and even that data would instantly become “stale” the second any change takes place. So, for example, if even one of your physicians changes their availability or location, the entire model becomes inaccurate and thus corrupt. Due to its enormous size and, generally speaking, its structure, retraining GPT-4 every time even the most minute adjustment takes place is essentially impossible.  

OpenAI released some “initial support” plug-ins last week that are supposed to fill this gap somewhat, but their security, effectiveness, and accuracy remain to be seen. That said, even in a best-case scenario where these APIs communicate perfectly with your system, connecting your EHR with a private company-owned trillion-parameters strong LLM is a guaranteed recipe for personal identifiable information (PII) breaches and leaks. 

To summarize: No real-time, constantly updated domain-specific knowledge + no viable way of training GPT-4 on new information = no go for patient-facing assignments. 

2. Safety and Predictability

OpenAI claims it spent six months making GPT-4 safer and more accurate than its predecessor, 3.5. According to the company, GPT-4 is 82% less likely than GPT-3.5 to respond to requests for content that OpenAI does not allow and 60% less likely to make stuff up. Although these are laudable improvements, GPT-4’s precision is still a far cry from the demands of patient-facing medical care.

So even in a hypothetical future where GPT-4 is accurate 99% of the time, a single error could have disastrous results within the life-or-death stakes of medical care. One incorrect answer to a patient’s query (e.g., what drug should be taken to address a specific condition) could lead to hospitalization and even death. 

Currently, (and as mentioned above) any use cases involving PII subject to HIPAA compliance are years away for GPT-4 as a foundation for a standalone solution. And let’s just say it is very unlikely that OpenAI would be open to exposing itself to the stringent requirements of the HIPAA Business Associate Agreement (BAA)

3. Explainability

LLMs like GPT-4 are essentially enormous black boxes; you know what went in (inputs) and what came out (outputs), but you have no idea what happened in between. When GPT-4 falls short, there is no way of debugging it or pinpointing the source of the error to fix it locally.

Solving an error starts with knowing how to find the problem, and as spectacularly capable as GPT-4 may be, that’s just one (out of many) things it can’t do.

6422F10Ba471Cd6F83D70770 Frame 29 1
Llms Are Essentially Enormous Black Boxes

GPT-4 Use Cases for Healthcare

If to indulge the robo-nurse metaphor a little further, GPT-4 can be described most aptly as a robo-secretary. 

If that sounds slightly underwhelming, keep in mind that in 2019, US physicians spent a total of 125 million hours completing documentation outside work hours. During a typical 11.4-hour workday, primary care doctors spend 4.5 hours on EHR duties while in the office and an additional 1.4 hours per day outside of clinic hours, in the early morning or after 6 p.m., including 51 minutes on the weekend. 

Health systems across the country are buckling under the load of (digital) paperwork, and much of this pressure can be offset by GPT-4 today. 

Here’s just a partial list of medical administrative tasks GPT-4 can tackle:


  1. Summarizing and analyzing research papers: GPT-4 is incredibly effective in summarizing, analyzing, and presenting key points from long-form text, drastically reducing the time needed to consume the latest research. 

  1. Improving clinical documentation: GPT-4 can be used to improve the quality of clinical documentation by identifying missing or incomplete information in medical records. 

  1. Coding and billing: 85% of physicians agree that documentation done solely for billing increases their total documentation time. GPT-4 can generate appropriate billing codes for procedures and treatments, help reduce errors and ensure clinicians are adequately compensated for their work.

  1. Clinical research: By analyzing large amounts of data and identifying patterns and trends that may not be easily traceable with traditional statistical methods, GPT-4 can be employed to help facilitate clinical research at increased velocity and scale. 

  1. Patient satisfaction analysis: You can leverage GPT-4 to sift through and analyze patient feedback and sentiment through thousands of surveys and social media interactions to identify critical areas for improvement in patient satisfaction, access, and experience.

  1. Crafting ongoing patient communication and emails: Primary care physicians spend an average of 49.5 minutes per day on electronic communication, including both email and secure messaging. With a few simple prompts, GPT-4 can draft perfectly written personalized messages for you.

An Ingredient, Not the Whole Dish

In a December 2022 blog post, just a month after ChatGPT was released and as people around the world were still trying to decipher what it actually meant, or, more importantly, whether or not it will take their jobs, Hyro’s CEO Israel Krush wrote the following:

ChatGPT (and/or any other LLM-based chatbot) is one major ingredient in the recipe, not the whole dish. While barriers have certainly been broken in natural language understanding (NLU), ChatGPT’s conversational prowess can only go so far. Statistically accurate isn’t enough – close to correct, for sensitive verticals such as healthcare and government, is a non-starter for enterprise implementation.

Krush added:

Michelin Star chefs don’t grow their own tomatoes, but they sure as hell know how to find the best around – similar to how most conversational AI solutions don’t reinvent the wheel with STT (Speech-to-Text) and TTS (Text-to-Speech), instead opting to source the strongest already in existence from Google or Microsoft, so too will conversational AI companies embed GPT-3.5, now the top large language model in the world, within their conversational stack.

GPT-4’s true potential for healthcare lies not in what it can do as a standalone offering or as the knowledge framework for ChatGPT Plus but in the incredible value it can add to much more dynamic, elastic, flexible, and perhaps most crucially, secure conversational AI solutions.

About the author
Ziv Gidron Head of Content, Hyro

Ziv is Hyro’s Head of Content and a passionate storyteller devoted to delivering his audiences with insights that matter when they matter most. When he’s not obsessively consuming or creating content, you can find him taking walks with his son in the orchards and fields that surround their home.