Product

Mailgun post mortem May 2016

A review of the incidents that impacted the availability of Mailgun's service in 2016.

PUBLISHED ON

PUBLISHED ON

This was originally reported on May 31, 2016.

Mailgun recently incurred several separate incidents that impacted the availability of our services. We’d like to take this opportunity to provide our customers with visibility around the root cause of these incidents along with details on what we’ve done to address them and ensure they do not occur again in the future.

Recent issues

Distributed Denial of Service (DDoS) Attacks

Mailgun is frequently the target of large and varied distributed denial of service (DDoS) attacks. While many attacks are blocked with minimal disruption, we’ve experienced several incidents where there has been a prolonged impact on our services. In particular, an attack on May 23rd targeted portions of our primary data center and the method of the attack was unique enough that our hosting provider required nearly an hour to effectively identify the attack and deploy the appropriate mitigations that allowed us to restore services to 100%.

API/SMTP timeouts or “SSL handshake” errors

Beginning earlier in the month, we observed a steady increase in customers experiencing abnormal timeouts when using the Mailgun API/SMTP service. As the number of reports regarding this issue increased, it became clear that there was a systemic issue with our service.

We maintain several different systems for monitoring throughput and latency in our service and our own data wasn’t correlating to what customers were experiencing. After we completed the investigation of our application, we escalated this issue to our hosting provider to inspect our network infrastructure to determine if a problem could be identified in our managed load balancer or other network devices.

The initial troubleshooting sessions were inconclusive as the issue itself was intermittent and difficult to reproduce. After several attempts, we were finally able to verify that the requests that were timing out were not reaching our edge networking device, leading us to investigate upstream devices in the network.

We started inspecting the device that is responsible for protecting our infrastructure from DDoS attacks and we observed that when this device was disabled, we were no longer able to reproduce these timeouts and we immediately began investigating to understand what the possible causes were. After analysis with our hosting provider’s DDoS team, we discovered that our DDoS mitigation system was impacting legitimate traffic. After making adjustments to our countermeasures, we were able to eliminate these errors.

Intermittent 421 errors

Mailgun returns a 421 error when we are unable to successfully queue a message. This error message is designed to notify the user that the message was not received by Mailgun and should re-attempted later. This is a normal part of SMTP and is used to signal to the sender to retry the message with a delay.

Last week, we started to see elevated levels of 421s being returned. The cause of this error was due to performance degradation we were experiencing with our Cassandra clusters, which is where we persist messages for storage.

The cause of our Cassandra performance issues was due to a compaction bug in the version of Cassandra we were running that was causing compactions to stall and disk I/O spikes reducing overall Cassandra throughput. While the cluster was in this condition, we were intermittently unable to store messages resulting in the 421 errors.

Corrective actions

  1. Alerting – While in many cases our DDoS mitigation system does not cause disruption, we’ve learned that it’s important for our Mailgun engineering team to know when the system is activated. Having this data allows us to more effectively correlate whether or not issues are being caused by these protections. We’ve already worked with our hosting provider to deploy an alerting system that alerts our engineers when these protections activate.

  2. Mitigation Profiles – We are tuning our DDoS mitigation profiles to improve our defensive posture while minimizing the impact to legitimate traffic. We’re working with our hosting provider to develop these profiles and expect this work to be completed this week.

  3. Cassandra Upgrade – We’ve started performing rolling upgrades of our Cassandra clusters to upgrade them to a version that is not impacted by the Cassandra compaction bug along with configuration adjustments that are more suitable for the type of workload.

  4. Infrastructure Design – We’re in the process of redesigning the underlying Mailgun infrastructure. This effort will give us a more robust network and deployment structure that will reduce the impact of similar types of attacks. This effort is underway and we will be sharing more details in the future.

Finally, while we know these types of incidents are challenging, the Mailgun team is committed to focusing on the plan above and any other steps necessary to ensure that you can continue relying on Mailgun for your email delivery.

Related readings

An expanded Mailgun product suite to transform email deliverability

Today marks a special day for Sinch Mailgun. For over a decade, our focus has been to provide the best email experience for businesses all around the world. Now, we take...

Read more

Privacy, automatic engagements, and Mailgun’s bot detection

Now more than ever, users are concerned about their data privacy and what steps they can take to protect their personal information. And that’s something big players in the tech...

Read more

How does Mailgun keep your emails protected?

On the surface, email seems relatively harmless – but dig a bit deeper and you’ll discover there’s a treasure trove of personally identifiable information (PII) at risk. This risk...

Read more

Popular posts

Email inbox.

Build Laravel 10 email authentication with Mailgun and Digital Ocean

When it was first released, Laravel version 5.7 added a new capability to verify user’s emails. If you’ve ever run php artisan make:auth within a Laravel app you’ll know the...

Read more

Mailgun statistics.

Sending email using the Mailgun PHP API

It’s been a while since the Mailgun PHP SDK came around, and we’ve seen lots of changes: new functionalities, new integrations built on top, new API endpoints…yet the core of PHP...

Read more

Statistics on deliverability.

Here’s everything you need to know about DNS blocklists

The word “blocklist” can almost seem like something out of a movie – a little dramatic, silly, and a little unreal. Unfortunately, in the real world, blocklists are definitely something you...

Read more

See what you can accomplish with the world's best email delivery platform. It's easy to get started.Let's get sending
CTA icon