November 26, 2020

GPT-3 vs. Existing Conversational AI Solutions

Guy Ling

Senior NLP Engineer at Hyro

Earlier this year, Elon Musk-backed artificial intelligence laboratory, OpenAI, released its latest, much anticipated autoregressive language model, the Generative Pre-trained Transformer 3 (GPT-3). Emerging to much fanfare and slated as the usherer of a new age of artificial intelligence, the number of articles, blog posts, and news pieces about this language model, perhaps match only the number of parameters the GPT-3 learned; 175 billion (Ok, this may be an exaggeration, but you get my point). 

This blog post will not present "cool" conversations I had with GPT-3, nor will it review the countless (commendable) poems, scripts, and essays authored by this highly advanced robo-Hemingway. 

So what will I be covering?

  • What is GPT-3?
  • Its limitations 
  • Why GPT-3 will not replace existing, sophisticated conversational AI solutions

What Is GPT-3?

To understand what GPT-3 is, we must first explain what a neural network is. Neural networks (or artificial neural networks) are simplified mathematical models replicating the functionalities of neurons in the human brain.  

An actual neuron consists of (among other components) incoming dendrites, a cell body, and an outgoing axon, which correspond to inputs, an activation, and an output. 

Almost every significant milestone in artificial intelligence in the past decade, from computer vision to speech recognition and generation, machine translation, and text generation, can be attributed to artificial neural networks. 

An artificial neuron features incoming weighted inputs, a cell body activated when the inputs cross a certain threshold, and an output. To learn a new task, the artificial neural network is exposed to a vast number of examples. 

For instance, if a neural network is tasked with recognizing images of cats and dogs, we would need to expose it to a multitude of images in order to train it to ascertain correctly which of the images is of a dog and which is of a cat, continually updating the weights (parameters) until the desired output is produced.  

Although these algorithms have catapulted us to new heights in artificial intelligence, they do have some crucial shortcomings. 

By and large, neural networks are colossal in size, which means that rather than learning anything, the neural network can simply use its weights or certain values to store the data inputted. If we showed the network very few examples of cats and dogs, it would hold the data in its weights, thus always retrieving the correct answer. This can often become a significant problem, as we want the network to not only memorize the data but also to generalize from it into new data, the same way humans do with extreme ease.

Neural Networks

The second term we must first establish before diving into GPT-3 is language models. 

In short, a language model is a model trained to see a sentence in natural language and output the probabilities of the actual next word or character in the sentence. 

Let's take a look at the example below. This language model is tasked with predicting the probability of the possible next words, in this sample; dog, mouse, squirrel, boy, and house. As simple as this task is for us humans, for a language model to complete the sentence correctly, it would need to undergo rigorous training and iteration.

Language Models

What makes GPT-3 so unique as a language model powered by neural networks is its sheer size. The chart below compares different language models by the number of parameters (roughly, weights) that they have learned.

OpenAI's GPT-3

As you can see, GPT-3 learned 1029% more parameters than runner up Turing NLG, at 175 billion compared with 17 billion. Conservative estimates place the cost of one training run of GPT-3 at $4.6 million. 

The specific architecture of the GPT-3 is mostly identical to its predecessor, GPT-2, but training this gargantuan-sized model is an engineering feat for the history books. OpenAI used an astronomical swath of the internet to train the model, which is a slight exaggeration but not too removed from reality. The data used to train GPT-3 comprises several corpora that include Common Crawl (a depository of the internet filtered for quality), the entire Wikipedia dump, and several other coding and math databases. 

So what can GPT-3 do? Well, for one thing, it can answer—in easily understood natural language—an expansive array of questions on any topic while retaining the context of previous questions asked. Every single item in the snippet below was answered correctly by the language model, and it was able to make the connection between the individual referred to in one answer (Dwight D. Eisenhower was president of the United States in 1955) to the following question (He belonged to the Republican Party).

GPT-3 Knowledge Retrieval
Source: Kevin Lacker’s blog

GPT-3 is also pretty fantastic at unsupervised machine translation. This is quite astonishing given that 93% of the tokens in its training data were English words; the rest (still 21 billion words, the length of 10,000 copies of Crime and Punishment) were all other languages together. But it's not only spoken languages that GPT-3 excels at but programming languages as well. In the image below, GPT-3 does an excellent job at translating a line of code from Java to Python.

Source: Twitter @Yoavgos

GPT-3’s Limitations 

Despite the exuberant resources and brain-power invested in GPT-3, it is not without its fatal flaws. Let's examine the Q&A session below. GPT-3 answers all questions correctly except for one: "Which is heavier, a toaster or a pencil?" to which it replied wrongly: "A pencil is heavier than a toaster." Now, this may seem like a small chink in GPT -3's armor, but in fact, it's incredibly revealing. If you recall, I previously stated that one of the critical shortcomings of any neural network is that rather than learning, generalizing, and inferring the same way humans do, it will often just memorize, storing the data in its weights. 

In the case study below, we can deduct that GPT-3 was able to provide answers that were saved somewhere on the internet, but when faced with a question that, as it would seem, has no answer online, the GPT-3 provides an incorrect output. 

GPT-3 may write like a human, but it cannot yet reason like one.

GPT-3 Inferred Knowledge
Source: Kevin Lacker's blog

Next, take a good look at the image below. GPT-3 gets every single answer wrong. But why? 

Source: Kevin Lacker's blog

Well, for starters, the independent United States did not exist until the year 1776. As humans, when we do not know the answer to a question or, in this example, a plausible answer does not exist, we can communicate that we do not know the answer. In the case of GPT-3 (and many other AI language models for that matter), it doesn't know that it doesn't know, and will still opt to retrieve an answer even if it’s incorrect. 

One could argue that Elizabeth I did indeed rule the United States in 1600, as she was the rightful monarch of what was then a British colony, but in no way was she the president of the United States. This again hints at the fact that GPT-3, with its massive cache of data, is able to memorize an amount of information that would put any encyclopedia to shame. Still, it cannot generalize or infer the way the average person can.

Why GPT-3 Will Not Replace Existing Conversational AI Solutions

A Question of Dynamics

Businesses are living, breathing organisms, forever developing, evolving, and renewing. If nothing else, 2020 has demonstrated quite exquisitely how quickly things can change in the 21st century and how rapidly new information becomes stale. GPT-3 was trained on data that was current up to October 2019; thus, it can name any dinosaur from the Mesozoic era, but it cannot tell you who the newly elected president of the United States is. 

To demonstrate how problematic this can become let's use a real-world example. Weill Cornell Medicine, one of our first and cherished clients, uses Hyro's conversational AI platform to empower its patients to easily find physicians by different attributes such as location, insurance, and specialties, schedule appointments online, troubleshoot portal issues and get the latest updates on COVID-19. 

These variables change all the time. 

Physicians move practices and retire, accept or reject insurances, and add new skills to their arsenals. The information we have on COVID-19 changes by the day, if not by the minute; further studies are published, new policies and recommendations are released by the CDC, and testing locations open and close. In the context of healthcare, irrelevant and outdated information can often, as this pandemic has exemplified time after time, entail dire, life-threatening repercussions. 

Hyro's conversational AI solution on Weill Cornell Medicine's website

But even if we examine this issue through the less weighty lens of e-commerce, it's easy to see how this absence of dynamics can be bad for business. Let's use another real-world example of one of our other clients, a leading designer, marketer, and distributor of branded aftermarket wheels (currently under NDA). To remain competitive, this company has to retain its position on the bleeding edge of aftermarket wheels. One of its attractive selling points is its ever-changing, ever-current selection of wheels and insider knowledge of the newest trends and innovations in the field. Providing its clientele with "antique" information can unquestionably damage its reputation and bottom line. 

In stark contrast, existing conversational AI solutions regularly updated either manually or automatically, provide users, whether concerned patients or discerning motorists, with fresh, relevant, and helpful answers to their queries. As a business continues to grow and expand, conversational AI solutions grow and scale with it, serving as the first point of contact for new and existing customers. 

A Black Box

GPT-3, like most neural networks, is a black box. Meaning that we humans (even GPT-3's own creators) can control the inputs (the data that goes in) and witness the outputs that come out but can't understand how variables are being combined to produce these outputs.

GPT-3 is so captivating because it can answer such a vast array of questions right, but it also gets quite a bit of them wrong, as we've already concluded. The crux is, when GPT-3 falls short, there is no way of debugging it or pinpoint the source of the error. It goes without saying that any customer-facing interface which cannot be iterated and revised is unsustainable or scalable in a business environment. 

This is another aspect in which existing conversational AI solutions are superior to GPT-3. Even bottom-shelf DIY chatbots allow their users to alter and improve their conversational flows as needed. In the case of more sophisticated conversational AI interfaces, users not only have a clear snapshot of the error but can track down, diagnose, and remedy the issue instantly. 

Prohibitive Pricing

Ultimately, there’s no such thing as a free lunch. While OpenAI may have launched a complimentary version of GTP-3 for a two-month private beta on July 11, October saw early-adopters having to choose between a steep scale of pricing plans, including four different offers based on a token system. That’s in line with their overall transformation from beloved non-profit to revenue-producing startup. One of the first researchers to gain access to the beta version, Gwern Branwen, shared the pricing tier with the Reddit community:


While tokens might sound friendly, they actually represent a tricky pricing structure that could potentially break the bank. Let’s dig into tokens as a unit of measurement; a token is the conversion of a sequence of text into smaller semantic units such as characters, and includes both prompt and completion phases. Robust NLP needs to be fed an alarmingly high amount of these tokens. Take the GPT-3 model itself, which consumed 499 billion tokens in order to achieve its current quality threshold. According to Branwen, a GTP-3 enthusiast, “a 2 million token of the 2nd tier will correspond to 3000 pages of text. And to relate it better: Shakespear’s entire work consists of ~900,000 words or 1.2 million tokens.” 

Murat Ayfer, creator of, a website that generates philosophical strings from user queries, claims that the token system will rack up an unaffordable bill for his business. With an average of 750,000 prompts per month that generate 400 million tokens in 2 or 3 weeks, Ayfer is on track to face charges of $4000 per month, minimum. Keep in mind that’s use case is not considered particularly extensive.

Outside of tech juggernauts and larger players, it’s infeasible to believe that startups and smaller entities will be able to offer the NLP experience that OpenAI originally promised based on the current pricing structure. Within the natural language community, one fear is that OpenAI customers will be forced to offload the costs on to their users, mixing ads with conversational AI.

So What’s Next?

Allow me to reiterate, GPT-3 is a landmark achievement that will pave the road and serve as a stepping stone for the awe-inspiring language models of the future. 

As it stands today, GPT-3 is not on track to replace the leading conversational interfaces on the market, but it can be utilized as a powerful tool to improve them. At Hyro, we employ different language models to fine-tune our knowledge graphs.  

At the end of the day, the "holy grail" of artificial intelligence is explainability and control. There may be a time in the not so distant future when language models can explain themselves and even argue or dissect their internal processes; but at present, businesses looking to provide their audiences with engaging, timely, and helpful conversational experiences will continue to rely on existing conversational AI solutions.