What Is MedPaLM and Why You Need to Know About It
On December 27th, 2022—just a couple of days after Christmas and less than a month post-ChatGPT’s grand reveal to the world—a team of Google and DeepMind researchers published their findings on a new large language model they named MedPaLM and its “encouraging results” in a medical and clinical setting.
And while the globe is still firmly grasped in a ChatGPT frenzy, its virtually anonymous “cousin” MedPaLM could potentially herald a not-too-distant revolution in how we administer care.
So What Exactly Is MedPaLM?
To understand what MedPaLM is, let’s first break down what each letter in its abbreviated name means (a bit of an acrostic poem, if you will):
MedPaLM is based on Google’s 540 billion parameters strong PaLM model and was specially adapted and optimized to be tested on MultiMedQA, a newly formulated benchmark for testing large language models (LLMs) for medical and clinical applications combining six existing open question-answering datasets spanning professional medical exams, research, and consumer queries (NedQA, MedMCQA, PubMedQA, LiveQA, MedicationQA, and MMLU) as well as a new dataset named HealthSearchQA focused on general medical knowledge searched for by consumers.
Pa stands for Pathways, a new AI architecture launched by Google in October 2021, designed to handle a multitude of tasks simultaneously and learn new tasks at hyper-speed. Breaking away from the conventional method of training machine learning models to perform a single task well, Pathways ushered in a new era of AI models capable of performing multiple tasks at once while also drawing upon and combining their existing skills to learn new tasks faster and more effectively.
“We want [this] model to have different capabilities that can be called upon as needed and stitched together to perform new, more complex tasks – a bit closer to the way the mammalian brain generalizes across tasks.” - Jeff Dean, Google Senior Fellow and SVP, Google Research
LM: Language Model
LM is short for Language Model (in this instance, a Large Language Model), a computational model designed to understand and generate human language. Or, in simpler terms, a statistical model that predicts the probability of the next word in a given sequence of words.
Given a sentence such as "The cat sat on the ____," a language model can predict that the most likely word to follow is "mat," "rug," or "chair," based on the probabilities of these words appearing in the context of the sentence.
Large language models can be trained on vast amounts of text corpora using machine learning algorithms to learn the patterns and structure of language.
Once trained, these models can be utilized for a range of natural language processing tasks, including text generation, translation, sentiment analysis, and more.
How Was MedPaLM Developed?
MedPaLM is the product of an extensive iterative experiment conducted by a team of Google and DeepMind researchers to assess the efficiency and accuracy of large language models for medical and clinical applications.
What does that mean?
Imagine you’re trying to perfectly tune a violin. To hit the perfect note, you need to continuously increase and decrease string tension and turn the knobs ever so gently to reach the desired outcome. In many ways, this is precisely what MedPaLM’s creators did to achieve their objective. So rather than thinking about it as an entirely new language model, think about it as Google’s PaLM model deliberately and meticulously calibrated for maximum precision in a medical setting.
Ok, so how was it “tuned”?
To tune or, more precisely, train MedPaLM, its researchers used prompts. Prompts for large language models are pieces of text used to guide or direct the generation of new text by the model. They are intended to help the model produce output that is relevant to a particular task or topic and can take various forms, such as fill-in-the-blank statements, questions, or context-setting phrases.
So, for example:
"Complete the following sentence: The capital of France is ____."
"What is the best way to prepare a grilled cheese sandwich?"
"Write a paragraph describing the causes of the American Civil War."
"In this passage from a novel, a character is introduced: 'She was tall and thin, with piercing blue eyes and a serious expression on her face.' What is the character's name and what is she doing in this scene?"
Why do I need a healthcare-purposed language model to tell me the best way to prepare a grilled cheese sandwich?
You don’t. Which is exactly why MedPaLM’s developers employed a technique called Instruction Prompt Tuning (IPT). Instruction Prompt Tuning is a method of fine-tuning a large language model for a specific task by providing examples of desired input and output pairs in the form of prompts.
So rather than construct an LLM with general answers to general topics, MedPaLM’s team used guidelines and exemplars from a panel of qualified clinicians for each consumer medical question-answering datasets to prompt tune Flan-PaLM (MedPaLM’s previous iteration) and form a hyper-healthcare-focused LLM: MedPaLM.
So How’s MedPaLM Performing?
Well, that really depends on what you want it to do.
In terms of providing answers to consumer medical questions, 94.4% of MedPaLM’s answers were judged as directly addressing the user’s question intent—only 1.5% less than a human clinician. 80.3% of its answers were found to be helpful, 11% less than a human clinician but 30% better than its predecessor Flan-PaLM.
Impressive? You bet.
But in other fields tested on, MedPaLM leaves much to be desired.
In 16.9% of its answers, MedPaLM retrieved incorrect information as opposed to 3.6% for human clinicians. Incorrect reasoning was seen in 10.1% of MedPaLM’s answers and in just 2.1% of clinician answers. Incorrect comprehension occurred in 18.7% of cases for MedPaLM and in 2.2% for clinicians.
What Does This Mean for Healthcare?
Trying to definitively predict how MedPaLM and other LLMs will be harnessed for clinical and medical purposes is a bit of a fool's game at this stage. As impressive as recent advancements in this field are, a single error could have disastrous results within the life-or-death stakes of medical care. One incorrect answer to a patient's query (i.e., what drug should be taken to address a specific condition) could lead to hospitalization and even death.
This issue is further exacerbated by the fact that large language models are essentially enormous black boxes and lack the explainability or the ability to pinpoint exactly why they gave a certain input.
Solving an error starts with knowing how to find the problem, and as spectacularly capable as MedPaLM may be, that's just one (out of many) things it can't do.
So, What Can LLMs be Used for in Healthcare?
- Summarizing and analyzing research papers: Approximately 7,287 articles (with an average of 5.16 minutes reading time) are published monthly in medical journals. To theoretically keep up with the literature, a physician would need to dedicate 627.5 hours per month, or about 29 hours per weekday, or 3.6 full-time equivalents of physician effort to the task. And though this isn't realistic in any scenario, LLMs are incredibly effective in summarizing, analyzing, and presenting key points from long-form text, drastically cutting down the time needed to stay up to date with the latest research.
- Improving clinical documentation: Large language models can be used to improve the quality of clinical documentation by identifying missing or incomplete information in medical records.
- Coding and billing: By generating appropriate codes for procedures and treatments, LLMs could help reduce errors and ensure clinicians are adequately compensated for their work.
- Population health management: Using LLMs, clinicians can identify trends and patterns in patient health and develop targeted interventions to improve population health outcomes.
- Clinical research: By analyzing large amounts of data and identifying patterns and trends that may not be easily traceable with traditional statistical methods, LLMs can be employed to facilitate clinical research at increased velocity and scale.
- Patient satisfaction analysis: LLMs can be leveraged to sift through and analyze patient feedback and sentiment through thousands of surveys and social media interactions to identify critical areas for improvement in patient satisfaction, access, and experience.
What About Appointment Scheduling?
LLMs are only as good as the data they’ve been trained on.
ChatGPT, for example—albeit its objectively impressive knowledge retrieval capabilities—was only trained on parameters dating to 2021. Any event that took place at any time past its cut-off date of September 2021 is entirely out of its scope.
So even though Boris Johnson resigned as Prime Minister of the United Kingdom in September 2022 and has since been succeeded by two prime ministers (Liz Truss and currently Rishi Sunak), ChatGPT is (self-admittedly) completely unaware.
But what if I train an LLM on a list of all of my health system’s physicians? Wouldn't that work?
It would. But only until the moment even the most minute change takes place. So, for instance, if just one of your physicians moves to a different clinic, adds another specialty to their care offerings, updates their working hours, or even resigns, your appointment-scheduling LLM will provide your patients with wrong and misleading information.
If even a single piece of data is no longer relevant, the entire model becomes corrupted and, therefore, obsolete.
The same can be said for any use case requiring real-time data to perform accurately. So, for example, site search, prescription refills, and up-to-date FAQs.
Oh, so LLMs can really only help me with administrative tasks?
On their own, yes, which is nothing to sneeze at and will save healthcare practitioners innumerable hours otherwise spent tackling bureaucratic office work. But for any use cases that demand real-time, up-to-date data, LLMs can only serve as an ingredient of a much more adaptive and flexible conversational platform, not the entire dish.
In this in-depth analysis, Hyro’s CEO Israel Krush lays out the ingredients needed to mix LLMs with enterprise-level conversational AI platforms to create truly elastic patient interactions that are driven by continuously updated organizational knowledge (proprietary data) and dynamic business logic (ever-changing internal and external processes).
Big Hero 6
When asked in a recent Becker's Hospital Review interview what he thought of ChatGPT and other LLMs, Andy Chu, SVP of Product and Technology at Providence Digital Innovation Group, said the industry is still a "long way off from the days of Big Hero 6,” a 2014 Disney film depicting a futuristic healthcare robot—But added he envisions this technology used to answer patient-related administrative, "decision-tree", or general health-education questions.
Chu's thoughts echo a broader sentiment shared by most digital health leaders that LLMs such as MedPaLM should be greeted with equal parts of excitement and caution. While there's no question these extraordinary "machines" can automate and accelerate a vast spectrum of primarily administrative applications, there are still glaring red flags when it comes to medical and clinical use. Furthermore, LLMs' absence of data adaptability renders them ineffective for any goal which calls for current and evolving internal organizational knowledge.
Chu put it best by saying: "It's important to remember that healthcare is very personal, and generative AI technologies are as good as the data accessed. Data sources within healthcare are rightfully secure and protected, so we must be careful not to overhype the promise — especially when it comes to areas such as symptom detections, triage, and clinical treatment suggestions."