Product

Mailgun post mortem September 2014

The full report on the connectivity issues that afflicted some Mailgun customers in September of 2014. Read more...

PUBLISHED ON

PUBLISHED ON

This was reported on October 2, 2014.

We want to provide you a full report on the connectivity issues that have afflicted some of our Mailgun customers these past few days.

The situation

Rackspace, our parent company and hosting provider, communicated with us on Friday, September 26th that it would be performing reboots on cloud instances in the Chicago data center (ORD) that could affect Mailgun’s infrastructure, between Sunday, September 28, 11:00 a.m. UTC, and Monday, September 29, 11:00 a.m. UTC. You can read Rackspace’s explanation of the reboots here: http://www.rackspace.com/blog/an-apology/

Mailgun operates several environments in multiple data centers using the Rackspace hybrid cloud setup that employs load balancers from F5 and are used as the primary gateway for all incoming and outgoing connections in our data centers. In the event of outages or maintenances in these environments, Mailgun gracefully reroutes traffic from the affected environments using F5 virtual IP pools, without customer impact or DNS changes. This gives us more control over the traffic, allowing us to gradually increase/decrease the load balancing ratios on the virtual IPs and distribute traffic across environments.

The plan

To prevent customer impact as a result of the ORD cloud instance reboots, we prepared to switch the traffic from this environment and made all necessary changes to do so. On Saturday, September 27, at 6:50 p.m. UTC, we noticed that all database replicas as well as internal traffic flow between environments between all regions was broken and started reporting timeouts.

We were unable to determine the root cause of the timeouts before the cloud instance reboots in ORD began and consequently could not switch the traffic in time to avoid customer impact.

What actually happened

Cloud instance reboots in ORD began at 1:10 a.m. UTC on Monday, September 30, and resulted in the following problems between the hours of 1:10 a.m. UTC and 11:29 a.m. UTC:

  • Intermittent connection failures

  • Lost events between 2:30 a.m. UTC and 7:00 a.m. UTC

  • Approximately 100,000 duplicate emails sent

All messages reported as accepted during this outage have been delivered.

A portion of Mailgun customers continued to experience intermittent connection loss and failure rates through the afternoon on Tuesday, September 30.

We worked with the Rackspace enterprise networking team on the investigation and were finally able to identify the source of the problem.

The cloud maintenance triggered the error condition on Mailgun F5 servers that started reporting the Path MTU of the value 296 for the IPs on Mailgun networks and was cached by all the F5s of Rackspace dedicated customers using Mailgun.

The F5s of the Rackspace dedicated customers cached MTU of the 296 but the servers behind the F5s were only capable of sending packets with the minimum of the 512 MTU which triggered the packet loss as Mailgun’s F5 enforced the MTU of the smaller value.

The Rackspace networking team cleared the caches on Mailgun’s F5s and some Rackspace dedicated customers’ F5s, and set up the new virtual IP for Mailgun services. Mailgun changed DNS settings, which forced clearing of the caches on the remote F5s. This workaround solved the remaining issues for the Rackspace dedicated customers using Mailgun.

We continue to work with Rackspace to investigate the root cause of the problem. At this point, we can say that it is related to an unusually low path MTU value that was originally cached and enforced by Mailgun’s F5 load balancers.

Going forward

We take uptime seriously and apologize to all Mailgun customers that were affected by this issue. Once we have completed our root cause analysis on the F5 issue, we will take steps to ensure that when our cloud infrastructure undergoes maintenance, there will be no impact on Mailgun customers.

Related readings

Here’s how to track email opens in Gmail with email tracking

Sending email campaigns doesn’t have to feel like you’re throwing darts into a black hole. Email analytics are a great way to determine the health of your ecommerce campaign and...

Read more

Everything you need to know about sending email with APIs

Are you creating an e-commerce web page that needs to send transactional emails to customers? A developer building a web application that needs to send messages to email...

Read more

A practical guide to using Mailgun’s webhooks

Transactional emails are essential for most apps. We send welcome emails, password...

Read more

Popular posts

Email inbox.

Build Laravel 10 email authentication with Mailgun and Digital Ocean

When it was first released, Laravel version 5.7 added a new capability to verify user’s emails. If you’ve ever run php artisan make:auth within a Laravel app you’ll know the...

Read more

Mailgun statistics.

Sending email using the Mailgun PHP API

It’s been a while since the Mailgun PHP SDK came around, and we’ve seen lots of changes: new functionalities, new integrations built on top, new API endpoints…yet the core of PHP...

Read more

Statistics on deliverability.

Here’s everything you need to know about DNS blocklists

The word “blocklist” can almost seem like something out of a movie – a little dramatic, silly, and a little unreal. Unfortunately, in the real world, blocklists are definitely something you...

Read more

See what you can accomplish with the world's best email delivery platform. It's easy to get started.Let's get sending
CTA icon