Mailgun post mortem September 2014

The full report on the connectivity issues that afflicted some Mailgun customers in September of 2014. Read more...



This was reported on October 2, 2014.

We want to provide you a full report on the connectivity issues that have afflicted some of our Mailgun customers these past few days.

The situation

Rackspace, our parent company and hosting provider, communicated with us on Friday, September 26th that it would be performing reboots on cloud instances in the Chicago data center (ORD) that could affect Mailgun’s infrastructure, between Sunday, September 28, 11:00 a.m. UTC, and Monday, September 29, 11:00 a.m. UTC. You can read Rackspace’s explanation of the reboots here:

Mailgun operates several environments in multiple data centers using the Rackspace hybrid cloud setup that employs load balancers from F5 and are used as the primary gateway for all incoming and outgoing connections in our data centers. In the event of outages or maintenances in these environments, Mailgun gracefully reroutes traffic from the affected environments using F5 virtual IP pools, without customer impact or DNS changes. This gives us more control over the traffic, allowing us to gradually increase/decrease the load balancing ratios on the virtual IPs and distribute traffic across environments.

The plan

To prevent customer impact as a result of the ORD cloud instance reboots, we prepared to switch the traffic from this environment and made all necessary changes to do so. On Saturday, September 27, at 6:50 p.m. UTC, we noticed that all database replicas as well as internal traffic flow between environments between all regions was broken and started reporting timeouts.

We were unable to determine the root cause of the timeouts before the cloud instance reboots in ORD began and consequently could not switch the traffic in time to avoid customer impact.

What actually happened

Cloud instance reboots in ORD began at 1:10 a.m. UTC on Monday, September 30, and resulted in the following problems between the hours of 1:10 a.m. UTC and 11:29 a.m. UTC:

  • Intermittent connection failures

  • Lost events between 2:30 a.m. UTC and 7:00 a.m. UTC

  • Approximately 100,000 duplicate emails sent

All messages reported as accepted during this outage have been delivered.

A portion of Mailgun customers continued to experience intermittent connection loss and failure rates through the afternoon on Tuesday, September 30.

We worked with the Rackspace enterprise networking team on the investigation and were finally able to identify the source of the problem.

The cloud maintenance triggered the error condition on Mailgun F5 servers that started reporting the Path MTU of the value 296 for the IPs on Mailgun networks and was cached by all the F5s of Rackspace dedicated customers using Mailgun.

The F5s of the Rackspace dedicated customers cached MTU of the 296 but the servers behind the F5s were only capable of sending packets with the minimum of the 512 MTU which triggered the packet loss as Mailgun’s F5 enforced the MTU of the smaller value.

The Rackspace networking team cleared the caches on Mailgun’s F5s and some Rackspace dedicated customers’ F5s, and set up the new virtual IP for Mailgun services. Mailgun changed DNS settings, which forced clearing of the caches on the remote F5s. This workaround solved the remaining issues for the Rackspace dedicated customers using Mailgun.

We continue to work with Rackspace to investigate the root cause of the problem. At this point, we can say that it is related to an unusually low path MTU value that was originally cached and enforced by Mailgun’s F5 load balancers.

Going forward

We take uptime seriously and apologize to all Mailgun customers that were affected by this issue. Once we have completed our root cause analysis on the F5 issue, we will take steps to ensure that when our cloud infrastructure undergoes maintenance, there will be no impact on Mailgun customers.

Related readings

What is single sign-on and how does it work?

Single sign-on is the “one ring to rule them all” of internet security. One ring to find them, one ring to bring them all, and bind the other rings through Identity and Access...

Read more

Security matters: Say hello to two-factor authentication (2FA)

It’s that kind of Monday: You show up to work, and you’ve been logged out of your Mailgun account. You input your password, and it prompts you to enter a verification code sent via...

Read more

Security success story: How Mailgun stopped a large-scale phishing attack

When you support companies around the world by facilitating the sending of billions of emails, you’ve got a pretty big target on your back. So, we expect threat actors to take...

Read more

Popular posts

Mailgun iconSee what you can accomplish with the world's best email delivery platform. It's easy to get started.Let's get sending
CTA icon Mailgun Icon