What We’re Building at Hyro
So we just raised our round A, maybe this is a good time to talk a bit about Hyro’s tech?
Hyro builds voice/chat interfaces (we call it “adaptive communications”) for a variety of verticals and use cases, mostly revolving around knowledge retrieval and performing actions on behalf of users. You can think of these solutions as somewhat like a Google assistant, (or Siri, Alexa, Cortana… depending on your identity politics).
Unlike these assistants, Hyro’s bot is focused on one business and it can interact on the phone as well as other channels such as text messaging.
For the tech audience I would also add that we are have a lot to do with linguistics, declarative / functional programming and knowledge representation.
By now you might be asking “so what? Isn’t that what a lot of companies are trying to do?” Well yes, but while others are focusing on different flavours of this thing, our focus is first and foremost on scale. We want to build a technology that commoditizes creation of bots with sophisticated behaviours. A kind of assembly line for bots.
The problem: We feel the way bots are built in the last few years won’t cut it. Things like intents and conversation trees are hard to scale and maintain. I’ll try to explain why, and what we’re doing instead.
As this might get a bit technical, I’ll start with some intuition. When you hire a call center representative, you don’t usually tell them how to talk on the phone. You don’t explain that it’s polite to say “Hi”, and “Goodbye”, and you don’t really care about the order or types of sentences they should expect. Instead you give them some knowledge (“read this manual”) and maybe some abilities (“here’s a system you can operate”). They already speak english and know how to interact on the phone, so they don’t need anyone to teach them that. We think it’s possible to reach a similar situation with bots.
I’ll mainly talk about two areas — intent recognition and conversation management.
The typical way people are doing intent recognition is by taking a large language mode (e.g. a large transformer model like the GPT family or google’s BERT) and training it to recognize a set of intents. Intents are abstractions over sentences, a kind of equivalence class. For example “hi” and “good morning” might be instances of the “greeting” intent. In theory there can be dozens or even hundreds of intents in a conversation, each taking a set of examples to train. These days language models have evolved considerably (well, mainly grown in size, but also somewhat evolved), so this task is not too hard for them to tackle. However, deciding about the taxonomy of intents, requires human expertise and a lot of tweaking given what happens after launch. Furthermore if something changes in the problem definition or users expectations, this taxonomy might change as well, at times requiring new tagging and training.
Another problem with the modeling of intents is the fact that intents are flat. If you ask any linguist and as you might remember from high school, sentences behave like trees. They are characterized by nesting: a sentence might have a noun phrase and a verb phrase, inside the noun phrase there can be a noun and an adjective, and so on. This is very different from a concatenation of intents, and flat intent modeling will have trouble capturing this structure.
So we decided to avoid having intent recognition in our system altogether. Instead what we do is we treat sentences in a more human-like fashion, by turning them into trees. We handle the parts individually and only then compose them together to create new meaning.
An example for that would be how we understand a sentence like “5 days after tomorrow”. We will actually treat its components individually, mapping “5” to a number, “days” to a time unit, “after” to itself — a special preposition, and “tomorrow” into a date. We then apply rules to combine elements together, a kind of an algebra of the language if you will: number + time unit = duration, (in our case 5 * 24 hours). Similarly, duration + after + date = date, so we take the date of tomorrow and add duration of 5 * 24 hours.
Although a silly example, what it shows is that these kinds of rules govern sentences without respect to the vertical or use case which is a boon for reusability. We call this approach: modeling the language, not the use case, and it’s not really that new, just a bit overshadowed by machine learning these days.
Taking this approach relies on having a knowledge source, because even though the rules stay the same, each domain has its own entities. For this reason we invest a lot in knowledge representation and acquisition. We build knowledge graphs which serve as a pluggable resource to our bots. These knowledge graphs serve multiple purposes: they both allow us to understand entities in a sentence, and if the user is seeking information, we can query that knowledge and construct a response using it. This is how we can understand and answer something like “I’m looking for a cardiologist that speaks spanish.”
To ease the knowledge acquisition stage, we’ve developed ways to crawl databases and websites, automatically turning them into knowledge graphs. We do this automatically, so when a website updates, the bot is already available with fresh knowledge. This is nice, because it’s one less thing for our clients to maintain.
Beyond the understanding part, other problems we tackle are conversation management and context handling. Today this is usually done with either:
- general code
- end to end machine learning
- conversation trees — usually through conversation design tools with states and intents being transitions between them (aka state machines)
General code is much too powerful and expressive to scale, and often results in “spaghetti code”, especially when working with languages that encourage mutability and state (like python).
End-to-end machine learning will suffer from similar problems as intent modeling will — low flexibility, dependence on tagged data and combinatorial explosion of paths.
Lastly, conversation design tools require highly verbose description, capturing a small set of possibilities and to us a maintenance nightmare.
To solve this problem we draw inspiration from the world of UI, which was revolutionized in the last 10 years by frameworks like angularjs and react. These frameworks taught us how to think about state and effects. In a nutshell: immutability, composability, binding and single direction flow of information. These are the principles of the functional programming paradigm, which we found to be useful for the problem of building bots as well. We developed a framework to compose graphs of functions, and we use it to get sophisticated behaviors from simple behaviours. For example, if we need to build a bot that gets some details from the user to do an action, we create a bunch of little bots, e.g. one for each detail and compose them together.
We think this approach is more scalable and reusable, because each part is self sufficient, and robust, which allows us to focus on the rules of the conversation instead of how the conversation takes place turn by turn. Through abstractions we are able to take business logic, knowledge and abilities — stuff that our clients are experts in, and have nothing to do with conversations — and translate them into bots. No training data required. This is how we are able to arrive with a working demo to the sales pitch, at times already with the knowledge already baked in.
Our approach may sound unorthodox in times where AI software writes essays and draws artistic pictures all by itself! But although machine learning can perform miracles (given enough data), we can’t sacrifice composability for it, combinatorics will doom us if we do. Neural networks are great for many things, but are not (yet) easily composable. Aside from transfer learning, you can’t really take them apart and put them together to get expected behaviours. This means it’s hard to build reusable parts.
But with code we could do better. We can construct small bot particles that have their own memory and interests, and compose them together to get rich behaviours. When we do use ML, we use it in a way that doesn’t require training per client, and that can fit in with all the other parts.
Consequently our system doesn’t have a single magical component. Instead it’s a collection of, for lack of a better word, idiotic things that become powerful only because they work together.
While we mature our system more and more (admittedly it’s still rough around some edges) we are learning quite a bit about programming in general and hope to transmit some of these insights by way of open source, so you are welcome to visit our github repos (e.g. gamla or computation-graph) or reach out for questions.
Thanks for listening!