Mailgun API Outage: Post mortem August 2016
A brief overview of the API Outage that occurred in August 2016. Read more...
This was originally posted on August 12, 2016.
Table of content
On August 4th at 22:20 UTC, Mailgun was alerted to numerous reports of DNS resolution failures for our mailgun.net domain name, which is the primary domain name we use for our API. As we began investigating, we discovered that our domain had been placed in “client hold” state by our domain registrar, Dynadot.
At approximately 22:47, Dynadot re-enabled the domain and we began seeing traffic increase to our api.mailgun.net domain. Traffic levels didn’t reach normal thresholds until 23:17 due to negative DNS record caches.
Once our team was alerted to the issue at 22:20 UTC, we began troubleshooting the issue by confirming that our authoritative name servers were still properly responding to DNS requests. We confirmed that our backend DNS infrastructure was both functional and properly configured and began reviewing our domain configuration with our registrar, Dynadot.
We found that our domain had been placed into a “client hold” state, which prevents normal DNS resolution. There was no recent communication prior or immediately before Dynadot took this punitive action against our domain.
Our engineering team immediately attempted to reach out to Dynadot support through both their phone system and their live chat support. During the entire duration of this incident, we were unable to reach their technical support team over the phone. A pre-recorded message informed the caller that all of their support agents were busy and to try calling later. We were able to reach an agent through their chat system who informed us that they were unable to resolve the issue and for the block to be lifted, we needed to needed to send an e-mail to their support team. The support agent insisted that this would be the only means available for us to resolve the issue. During this time, we also requested a call from a manager and attempted to escalate the issue, but were not offered any additional options to resolve the issue.
At 22:37, we sent an e-mail to their support team and our domain was re-enabled approximately ten minutes later at 22:47. While the domain was enabled, we didn’t receive any further communication from Dynadot until Friday at 00:12 UTC stating that the domain had been disabled for spam complaints but did not receive any corresponding information supporting this claim.
We requested that this issue be escalated to someone in their leadership team and for us to receive details about the complaints that were received. We’ve not received any additional details and, so far, we’ve not spoken with anyone in their senior leadership team about this incident.
Actions and lessons learned
Our relationship with Dynadot has existed since 2010, which predates the acquisition of Mailgun by Rackspace. While we’ve historically had few issues managing our domains, this incident, and most importantly the inability to receive timely feedback from Dynadot made it necessary to change service providers. As of today, we’ve completed the transition to Rackspace’s corporate domain registration service. They provide round-the-clock service and will help ensure similar issues do not recur. While we’ve moved our business relationship away, we still welcome discussions with Dynadot about this issue in hopes that their policies and procedures can be updated to avoid causing issues with other legitimate customers in the future.