• What's new

Mailgun Post Mortem September 2014

Mailgun Team
5 min read

This was reported on October 2, 2014.

We want to provide you a full report on the connectivity issues that have afflicted some of our Mailgun customers these past few days.

The Situation

Rackspace, our parent company and hosting provider, communicated with us on Friday, September 26th that it would be performing reboots on cloud instances in the Chicago data center (ORD) that could affect Mailgun’s infrastructure, between Sunday, September 28, 11:00 a.m. UTC, and Monday, September 29, 11:00 a.m. UTC. You can read Rackspace’s explanation of the reboots here: http://www.rackspace.com/blog/an-apology/

Mailgun operates several environments in multiple data centers using the Rackspace hybrid cloud setup that employs load balancers from F5 and are used as the primary gateway for all incoming and outgoing connections in our data centers. In the event of outages or maintenances in these environments, Mailgun gracefully reroutes traffic from the affected environments using F5 virtual IP pools, without customer impact or DNS changes. This gives us more control over the traffic, allowing us to gradually increase/decrease the load balancing ratios on the virtual IPs and distribute traffic across environments.

The Plan

To prevent customer impact as a result of the ORD cloud instance reboots, we prepared to switch the traffic from this environment and made all necessary changes to do so. On Saturday, September 27, at 6:50 p.m. UTC, we noticed that all database replicas as well as internal traffic flow between environments between all regions was broken and started reporting timeouts.

We were unable to determine the root cause of the timeouts before the cloud instance reboots in ORD began and consequently could not switch the traffic in time to avoid customer impact.

What Actually Happened

Cloud instance reboots in ORD began at 1:10 a.m. UTC on Monday, September 30, and resulted in the following problems between the hours of 1:10 a.m. UTC and 11:29 a.m. UTC:

  • Intermittent connection failures

  • Lost events between 2:30 a.m. UTC and 7:00 a.m. UTC

  • Approximately 100,000 duplicate emails sent

All messages reported as accepted during this outage have been delivered.

A portion of Mailgun customers continued to experience intermittent connection loss and failure rates through the afternoon on Tuesday, September 30.

We worked with the Rackspace enterprise networking team on the investigation and were finally able to identify the source of the problem.

The cloud maintenance triggered the error condition on Mailgun F5 servers that started reporting the Path MTU of the value 296 for the IPs on Mailgun networks and was cached by all the F5s of Rackspace dedicated customers using Mailgun.

The F5s of the Rackspace dedicated customers cached MTU of the 296 but the servers behind the F5s were only capable of sending packets with the minimum of the 512 MTU which triggered the packet loss as Mailgun’s F5 enforced the MTU of the smaller value.

The Rackspace networking team cleared the caches on Mailgun’s F5s and some Rackspace dedicated customers’ F5s, and set up the new virtual IP for Mailgun services. Mailgun changed DNS settings, which forced clearing of the caches on the remote F5s. This workaround solved the remaining issues for the Rackspace dedicated customers using Mailgun.

We continue to work with Rackspace to investigate the root cause of the problem. At this point, we can say that it is related to an unusually low path MTU value that was originally cached and enforced by Mailgun’s F5 load balancers.

Going Forward

We take uptime seriously and apologize to all Mailgun customers that were affected by this issue. Once we have completed our root cause analysis on the F5 issue, we will take steps to ensure that when our cloud infrastructure undergoes maintenance, there will be no impact on Mailgun customers.

Tags: Security

Last updated on August 27, 2019

  • Related posts
  • Recent posts
  • Top posts
View all

Always be in the know and grab free email resources!

No spam, ever. Only musings and writings from the Mailgun team.

Mailgun is committed to protecting your privacy. Please read ourPrivacy Policybefore providing us with your details.

sign up

It's easy to get started. And it's free.

What will you accomplish with 10,000 free emails and a 100 free validations every month?
Sign up for Free