Back to main menu

Product

Mailgun post mortem September 2014

The full report on the connectivity issues that afflicted some Mailgun customers in September of 2014. Read more...

PUBLISHED ON

PUBLISHED ON

This was reported on October 2, 2014.

We want to provide you a full report on the connectivity issues that have afflicted some of our Mailgun customers these past few days.

The situation

Rackspace, our parent company and hosting provider, communicated with us on Friday, September 26th that it would be performing reboots on cloud instances in the Chicago data center (ORD) that could affect Mailgun’s infrastructure, between Sunday, September 28, 11:00 a.m. UTC, and Monday, September 29, 11:00 a.m. UTC. You can read Rackspace’s explanation of the reboots here: http://www.rackspace.com/blog/an-apology/

Mailgun operates several environments in multiple data centers using the Rackspace hybrid cloud setup that employs load balancers from F5 and are used as the primary gateway for all incoming and outgoing connections in our data centers. In the event of outages or maintenances in these environments, Mailgun gracefully reroutes traffic from the affected environments using F5 virtual IP pools, without customer impact or DNS changes. This gives us more control over the traffic, allowing us to gradually increase/decrease the load balancing ratios on the virtual IPs and distribute traffic across environments.

The plan

To prevent customer impact as a result of the ORD cloud instance reboots, we prepared to switch the traffic from this environment and made all necessary changes to do so. On Saturday, September 27, at 6:50 p.m. UTC, we noticed that all database replicas as well as internal traffic flow between environments between all regions was broken and started reporting timeouts.

We were unable to determine the root cause of the timeouts before the cloud instance reboots in ORD began and consequently could not switch the traffic in time to avoid customer impact.

What actually happened

Cloud instance reboots in ORD began at 1:10 a.m. UTC on Monday, September 30, and resulted in the following problems between the hours of 1:10 a.m. UTC and 11:29 a.m. UTC:

  • Intermittent connection failures

  • Lost events between 2:30 a.m. UTC and 7:00 a.m. UTC

  • Approximately 100,000 duplicate emails sent

All messages reported as accepted during this outage have been delivered.

A portion of Mailgun customers continued to experience intermittent connection loss and failure rates through the afternoon on Tuesday, September 30.

We worked with the Rackspace enterprise networking team on the investigation and were finally able to identify the source of the problem.

The cloud maintenance triggered the error condition on Mailgun F5 servers that started reporting the Path MTU of the value 296 for the IPs on Mailgun networks and was cached by all the F5s of Rackspace dedicated customers using Mailgun.

The F5s of the Rackspace dedicated customers cached MTU of the 296 but the servers behind the F5s were only capable of sending packets with the minimum of the 512 MTU which triggered the packet loss as Mailgun’s F5 enforced the MTU of the smaller value.

The Rackspace networking team cleared the caches on Mailgun’s F5s and some Rackspace dedicated customers’ F5s, and set up the new virtual IP for Mailgun services. Mailgun changed DNS settings, which forced clearing of the caches on the remote F5s. This workaround solved the remaining issues for the Rackspace dedicated customers using Mailgun.

We continue to work with Rackspace to investigate the root cause of the problem. At this point, we can say that it is related to an unusually low path MTU value that was originally cached and enforced by Mailgun’s F5 load balancers.

Going forward

We take uptime seriously and apologize to all Mailgun customers that were affected by this issue. Once we have completed our root cause analysis on the F5 issue, we will take steps to ensure that when our cloud infrastructure undergoes maintenance, there will be no impact on Mailgun customers.

Related readings

BFCM 2024: How email and omnichannel communications drove holiday success

Black Friday and Cyber Monday represent peak email sending periods that can make or break retailers' holiday success. During November...

Read More

How to improve your email deliverability for the future of email

If your customers aren’t getting your emails, then there’s a good chance that your email program needs some refreshing with these email deliverability tips taken from Email Camp: MessageMania speaker and industry pro, Laura Atkins.

Read More

Bulk validations: Accurate and fast AF

Cleaning an old email list is one of those must needed chores that nobody really wants to do...

Read More

Popular posts

Email inbox.

Email

5 min

Build Laravel 11 email authentication with Mailgun and Digital Ocean

Read More

Mailgun statistics.

Product

4 min

Sending email using the Mailgun PHP API

Read More

Statistics on deliverability.

Deliverability

5 min

Here’s everything you need to know about DNS blocklists

Read More

See what you can accomplish with the world's best email delivery platform. It's easy to get started.Let's get sending
CTA icon