Executive Summary
On June 3rd, 2022, at 11:55 AM EST some of Hyro’s virtual assistants were starting to fail. At 01:03 PM EST we saw via our monitoring system that all of Hyro’s virtual assistants, along with certain services, were unavailable. Four years back, Hyro was named Airbud and bought the domain http://airbud.io.
The domain was purchased by Hyro’s CEO on June 3rd, 2018, and on June 3rd, 2022 our auto-renewal mechanisms failed to renew it. Once the domain expired, our systems started to fail in production since the domain was unavailable. Upon identifying this, the domain was renewed and the DNS servers were restarted in order to update the domain across the globe. As a mitigation step, we connected our call center assistants to another domain named http://hyro.ai which helped us to fully recover our call center assistants.
Our call center assistants recovered at 01:48 PM EST.
The incident was fully resolved at 02:25 PM EST
At Hyro, we take the reliability of our services very seriously. We apologize for this unexpected disruption in our services.
We conducted a post-mortem early this week and have begun work on action items designed to make us more resilient to this type of event in the future. The remainder of this post provides details of our findings and plans to mitigate future incidents.
Lead-Up
Hyro’s domain name for its production environment,
Airbud.io, expired, causing Hyro’s systems to be inaccessible.
What’s a DNS (Domain Name System)?
All devices with internet connectivity, from smartphones to laptops to the servers that provide content for massive retail websites, find and communicate with one another by using IP addresses. This ensures that when you visit a website, you don't have to remember and enter a long set of digits. Instead, you can enter a domain name like http://example.com and still end up at the desired destination.
It is important to note that DNS relies on caching mechanisms. What that means is that if you visited your domain name on your computer, updated the DNS records to point your domain to a new website with a different IP address, and then visited your domain name again on the same computer, you’re likely to see the old website due to this cache.
So even after we’d renewed our domain relatively quickly, it still took time for the network to invalidate the cache and retrieve the updated domain.
Fault
All of Hyro’s domains are managed in one main account. This account is set to auto-renew and notify the CIO a week before they’ve expired.
The http://airbud.io domain was bought using Hyro’s CEO’s personal account before the company was incorporated, and as a result, a notification wasn’t sent to anyone at Hyro.
Detection
Hyro’s on-call engineer noticed a large number of connection Errors at 12:02 PM and an incident was opened at 12:14 PM.
Due to the nature of the incident (wrong DNS), not all assistants stopped working at once, and it took time for the issue to gradually appear in relation to all deployed assistants. At 12:49 PM, one of Hyro’s customer success managers noticed that the chat assistants weren’t showing up on customers' websites. At 1:03 PM, the team handling the incident concluded both issues were occurring from the same origin and identified the root cause http://airbud.io domain expired). Due to the team’s understanding that there was an auto-renew mechanism in place, the detection of the expired domain took some time to uncover.
Mitigation and Resolution
As a first step, we updated our status page hyro.status.ai to publicly inform about the incident.
When we determined that the incident had a much larger impact than we originally thought, the status page was updated to alert that all of Hyro’s virtual assistants were unavailable.
In parallel, once the root cause was identified, the CIO called the domain provider and renewed the domain.
While the DNS servers were updated, another domain http://hyro.ai, which wasn’t expired, was enabled and the call center assistants' endpoints were updated to use it. This brought all the call center assistants back online. 30 minutes to an hour later, region by region, Hyro’s web-based assistants came back online.
It is important to note, that for a period of time some web-based assistants appeared on our customers' websites but our APIs were still not operational (the DNS propagated to one service but not the other) which caused our web-based assistants to automatically respond with a general error message.
What Went Well
1. Recently, our engineering team upgraded our monitoring capabilities and incident management tools and protocols. We were happy to see that our new on-call system allowed us to act quickly. Our new incident management procedure drove us to assemble the team swiftly, activate our new tools, and involve the engineering management team (VP Engineering, Chief Architect, and CIO) in a matter of minutes.
2. After the severity of the incident was understood, the R&D management team handled the incident quickly and resolved it in a fast and thorough manner.
Action Items
1. We will ensure that 2 additional people from the engineering management team will gain access to our domain vendor and will get reminders about domain expiry, especially those that could not be auto-renewed.
2. We will add an alert that will constantly check the validity of our domains, as well as certificates (https).
3. We will make sure all domains are managed in one location, and that all are set to auto-renew.
Conclusion
This incident has been a less-than-pleasant experience for our customers, and does not speak to the standards with which we uphold ourselves here at Hyro. We are fully aware of our responsibility to you, our partners in conversational excellence, as you trust us to consistently power meaningful conversations with your customers, 24/7. We are committed to learning from this experience in order to continue delivering value-driven improvements to our services.