What Happened Yesterday And What We Are Doing About It

July 16, 2019

Reported on September 19, 2013.

What happened?Message delays Duplicate messages

What we are doing about it?More resilient architecture Better communication of issues

Yesterday was a bad day for Mailgun customers. We experienced significant delays and duplication of emails from approximately 20:40 pm UTC to 04:40 UTC. While we don’t have any evidence of lost messages, the delays sending emails were significant. This is unacceptable so we want to provide you an outline of what happened and what we are doing to protect against this happening again.

What happened?

At 20:40 UTC, we received a spike in messages that triggered a series of cascading failures in our system. These spikes are usually handled by Mailgun without problems, but in this particular case it triggered a bug in Mailgun that slowed down our Riak clusters by overloading them with unnecessary requests and consuming excessive storage. As we introduced more load on the cluster it triggered garbage collection on nodes that made the situation worse. Riak survived (thanks Basho team for writing robust software), but resulted in significant message delays. In order to recover from the immediate issue, we restarted several processes which, in some cases, caused duplicate messages to be sent.

Message delays

As the overall performance of the system went down, messages were queued, but not delivered at normal speed. We identified the bug and rolled out the fix, but it took us a while to get clusters back to their normal state and clear out the backlogged queue of messages.

Duplicate messages

As our delivery nodes slowed down, our monit scripts started killing delivery nodes and restarting them as part of our emergency recovery procedure. This ungraceful shut down and restart caused duplicate messages to be delivered for a small number of customers as some messages were sent but not marked as delivered in our system and were retried after the process restarted.

What we are doing about it?

This level of performance degradation for Mailgun is unacceptable. Our customers trust us to deliver their mission critical emails, and we let them down. As a result of this outage, we are going to implement some changes.

More resilient architecture

First of all, we have identified the bug in our system that caused the slow down and rolled out a fix. In addition, we are in the process of re-architecting our core storage and routing processes so that they are more fault tolerant and will perform better in these situations.

Better communication of issues

This event has made it clear that the Mailgun’s status page is not always an accurate reflection of Mailgun status. Though our API and SMTP services were technically “available” yesterday, significant email delays are, in practice, a service impacting event and we should be transparent about that. As a result, we’ve already moved our status page from pingdom to Statuspage.io, so that we can provide a single place for incident alerts. You will be able to subscribe to alerts via SMS, webhook, Twitter or email so you know the moment Mailgun is experiencing issues. Longer term, we will be adding information about Mailgun’s queue size and other metrics that are more descriptive regarding performance. In addition, in each customer’s Mailgun control panel, we will be adding more details about each customer’s own queue size and performance.

Making it right

We believe that Mailgun should always be available and performant. Significant email delays do not meet this criteria. We do offer an SLA and while this technically did not qualify as an outage, if this affected your business, we’d like to make it right. You can send an email to sla@mailgun.net and we can discuss an appropriate credit as compensation for this issue.

All in all, yesterday was a tough day for our customers and for us. We are very sorry for this issue and we are determined to do everything in our power to make sure a situation like this does not happen again.

The Mailgunners

Author: The Sinch Mailgun team The Sinch Mailgun team shares news, best practices, and strategies to take your products and apps to the next level using email. Subscribe to our newsletter to get all the articles in your inbox!

Cookie Subgroup	Cookies	Cookies used
documentation.mailgun.com	_cfuvid , __cfruid	First Party
.mailgun.com	OptanonConsent	First Party
mailgun.com	actualOptanonConsent , apt.sid , OptanonAlertBoxClosed , mail_session	First Party
app.mailgun.com	connect.sid , SERVERID	First Party
hello.mailgun.com	uvts , __cf_bm	First Party
m.stripe.com	m	Third Party

Cookie Subgroup	Cookies	Cookies used
hello.mailgun.com	ubpv , ubvs	First Party
dev.mailgun.com	_an_uid	First Party
app.mailgun.com	rl_page_init_referring_domain , rl_anonymous_id , rl_group_trait	First Party
mailgun.com	_vwo_ds , test_rudder_cookie , _vis_opt_s , rl_group_id , rl_user_id , _ga , rl_session , rl_page_init_referrer , ubvt , _vwo_uuid , apt.uid , optimizelyEndUserId , _gat , _vwo_sn , _ga_xxxxxxxxxx , _gid , _uetvid , _vis_opt_test_cookie	First Party
hello.learn.mailgun.com	visitor_id	Third Party

Cookie Subgroup	Cookies	Cookies used
hello.mailgun.com	_gd_session	First Party
mailgun.com	__q_state_zkTi4FmbUJniF8K2 , _vwo_uuid_v2 , apt.temp-xxxxxxxxxxxxxxxxxx , __tld__	First Party
demo.mailgun.com	_gd_visitor	First Party
dev.mailgun.com	_pin_unauth , __uvt	First Party
app.mailgun.com	__stripe_mid , __stripe_sid	First Party
mailgun.zendesk.com	_cfuvid, __cf_bm, __cfruid	Third Party
vimeo.com	_cfuvid, __cf_bm, vuid	Third Party
producthunt.com	__cf_bm	Third Party
goldcast.io	__cf_bm	Third Party

Cookie Subgroup	Cookies	Cookies used
www.mailgun.com	pardot	First Party
app.mailgun.com	rl_trait	First Party
hello.mailgun.com	visitor_id	First Party
mailgun.com	_gat_gtag_xxxxxxxxxxxxxxxxxxxxxxxxxxx , _uetsid , _fbp , _tt_enable_cookie , _ttp , _rdt_uuid , __q_domainTest , _gcl_au	First Party
linkedin.com	bcookie, lidc, li_gc	Third Party
pi.pardot.com	pardot, lpv830283	Third Party
bing.com	MSPTC, MUID	Third Party
hello.learn.mailgun.com	pardot	Third Party
pardot.com	visitor_id	Third Party
doubleclick.net	IDE, test_cookie	Third Party
youtube.com	VISITOR_PRIVACY_METADATA, VISITOR_INFO1_LIVE, __Secure-xxxxxxx	Third Party
www.google.com	_GRECAPTCHA	Third Party

Table of contents

What happened?

Message delays

Duplicate messages

What we are doing about it?

More resilient architecture

Better communication of issues

Making it right

Related articles

New Mailgun Zapier Integration: Validations, alerts, and AI-powered workflows

Product Release: Introducing Mailgun Inspect

Mailgun expands email security with free DMARC reporting in partnership with Red Sift