Mailgun Post Mortem May 2016

July 16, 2019

This was originally reported on May 31, 2016.

Recent issues Distributed Denial of Service (DDoS) Attacks API/SMTP timeouts or “SSL handshake” errors Intermittent 421 errors Corrective actions

Mailgun recently incurred several separate incidents that impacted the availability of our services. We’d like to take this opportunity to provide our customers with visibility around the root cause of these incidents along with details on what we’ve done to address them and ensure they do not occur again in the future.

Recent issues

Distributed Denial of Service (DDoS) Attacks

Mailgun is frequently the target of large and varied distributed denial of service (DDoS) attacks. While many attacks are blocked with minimal disruption, we’ve experienced several incidents where there has been a prolonged impact on our services. In particular, an attack on May 23rd targeted portions of our primary data center and the method of the attack was unique enough that our hosting provider required nearly an hour to effectively identify the attack and deploy the appropriate mitigations that allowed us to restore services to 100%.

API/SMTP timeouts or “SSL handshake” errors

Beginning earlier in the month, we observed a steady increase in customers experiencing abnormal timeouts when using the Mailgun API/SMTP service. As the number of reports regarding this issue increased, it became clear that there was a systemic issue with our service.

We maintain several different systems for monitoring throughput and latency in our service and our own data wasn’t correlating to what customers were experiencing. After we completed the investigation of our application, we escalated this issue to our hosting provider to inspect our network infrastructure to determine if a problem could be identified in our managed load balancer or other network devices.

The initial troubleshooting sessions were inconclusive as the issue itself was intermittent and difficult to reproduce. After several attempts, we were finally able to verify that the requests that were timing out were not reaching our edge networking device, leading us to investigate upstream devices in the network.

We started inspecting the device that is responsible for protecting our infrastructure from DDoS attacks and we observed that when this device was disabled, we were no longer able to reproduce these timeouts and we immediately began investigating to understand what the possible causes were. After analysis with our hosting provider’s DDoS team, we discovered that our DDoS mitigation system was impacting legitimate traffic. After making adjustments to our countermeasures, we were able to eliminate these errors.

Intermittent 421 errors

Mailgun returns a 421 error when we are unable to successfully queue a message. This error message is designed to notify the user that the message was not received by Mailgun and should re-attempted later. This is a normal part of SMTP and is used to signal to the sender to retry the message with a delay.

Last week, we started to see elevated levels of 421s being returned. The cause of this error was due to performance degradation we were experiencing with our Cassandra clusters, which is where we persist messages for storage.

The cause of our Cassandra performance issues was due to a compaction bug in the version of Cassandra we were running that was causing compactions to stall and disk I/O spikes reducing overall Cassandra throughput. While the cluster was in this condition, we were intermittently unable to store messages resulting in the 421 errors.

Corrective actions

Alerting – While in many cases our DDoS mitigation system does not cause disruption, we’ve learned that it’s important for our Mailgun engineering team to know when the system is activated. Having this data allows us to more effectively correlate whether or not issues are being caused by these protections. We’ve already worked with our hosting provider to deploy an alerting system that alerts our engineers when these protections activate.
Mitigation Profiles – We are tuning our DDoS mitigation profiles to improve our defensive posture while minimizing the impact to legitimate traffic. We’re working with our hosting provider to develop these profiles and expect this work to be completed this week.
Cassandra Upgrade – We’ve started performing rolling upgrades of our Cassandra clusters to upgrade them to a version that is not impacted by the Cassandra compaction bug along with configuration adjustments that are more suitable for the type of workload.
Infrastructure Design – We’re in the process of redesigning the underlying Mailgun infrastructure. This effort will give us a more robust network and deployment structure that will reduce the impact of similar types of attacks. This effort is underway and we will be sharing more details in the future.

Finally, while we know these types of incidents are challenging, the Mailgun team is committed to focusing on the plan above and any other steps necessary to ensure that you can continue relying on Mailgun for your email delivery.

Author: Josh Odom Josh Odom is President of the CPaaS unit at Sinch. Previously, he acted as Chief Technology Officer of Mailgun, and the Senior Product and Engineering leader at Rackspace. He covers a variety of topics for the blog from incidents and data security to product growth.

Cookie Subgroup	Cookies	Cookies used
documentation.mailgun.com	_cfuvid , __cfruid	First Party
.mailgun.com	OptanonConsent	First Party
mailgun.com	actualOptanonConsent , apt.sid , OptanonAlertBoxClosed , mail_session	First Party
app.mailgun.com	connect.sid , SERVERID	First Party
hello.mailgun.com	uvts , __cf_bm	First Party
m.stripe.com	m	Third Party

Cookie Subgroup	Cookies	Cookies used
hello.mailgun.com	ubpv , ubvs	First Party
dev.mailgun.com	_an_uid	First Party
app.mailgun.com	rl_page_init_referring_domain , rl_anonymous_id , rl_group_trait	First Party
mailgun.com	_vwo_ds , test_rudder_cookie , _vis_opt_s , rl_group_id , rl_user_id , _ga , rl_session , rl_page_init_referrer , ubvt , _vwo_uuid , apt.uid , optimizelyEndUserId , _gat , _vwo_sn , _ga_xxxxxxxxxx , _gid , _uetvid , _vis_opt_test_cookie	First Party
hello.learn.mailgun.com	visitor_id	Third Party

Cookie Subgroup	Cookies	Cookies used
hello.mailgun.com	_gd_session	First Party
mailgun.com	__q_state_zkTi4FmbUJniF8K2 , _vwo_uuid_v2 , apt.temp-xxxxxxxxxxxxxxxxxx , __tld__	First Party
demo.mailgun.com	_gd_visitor	First Party
dev.mailgun.com	_pin_unauth , __uvt	First Party
app.mailgun.com	__stripe_mid , __stripe_sid	First Party
mailgun.zendesk.com	_cfuvid, __cf_bm, __cfruid	Third Party
vimeo.com	_cfuvid, __cf_bm, vuid	Third Party
producthunt.com	__cf_bm	Third Party
goldcast.io	__cf_bm	Third Party

Cookie Subgroup	Cookies	Cookies used
www.mailgun.com	pardot	First Party
app.mailgun.com	rl_trait	First Party
hello.mailgun.com	visitor_id	First Party
mailgun.com	_gat_gtag_xxxxxxxxxxxxxxxxxxxxxxxxxxx , _uetsid , _fbp , _tt_enable_cookie , _ttp , _rdt_uuid , __q_domainTest , _gcl_au	First Party
linkedin.com	bcookie, lidc, li_gc	Third Party
pi.pardot.com	pardot, lpv830283	Third Party
bing.com	MSPTC, MUID	Third Party
hello.learn.mailgun.com	pardot	Third Party
pardot.com	visitor_id	Third Party
doubleclick.net	IDE, test_cookie	Third Party
youtube.com	VISITOR_PRIVACY_METADATA, VISITOR_INFO1_LIVE, __Secure-xxxxxxx	Third Party
www.google.com	_GRECAPTCHA	Third Party

Table of contents

Recent issues

Distributed Denial of Service (DDoS) Attacks

API/SMTP timeouts or “SSL handshake” errors

Intermittent 421 errors

Corrective actions

Related articles

New Mailgun Zapier Integration: Validations, alerts, and AI-powered workflows

Product Release: Introducing Mailgun Inspect

Mailgun expands email security with free DMARC reporting in partnership with Red Sift