Home

Deliverability

IT & Engineering

Product

Dev Life

Company

Jobs

Product

Mailgun authentication service: Post mortem July 2018

A review of the July 2018 Authentication Service downtime. Read more...

PUBLISHED ON July 16, 2019

This was originally posted on July 18, 2018.

This is what happened

Why did this happen? What did you do about it?

Lessons learned

Table of contents

01This is what happened

02Why did this happen? What did you do about it?

03Lessons learned

This is what happened

As a part of ongoing work by our engineering teams, several of our internal and external services were updated to delegate authentication to a centralized authentication service. One of those updated services was deployed at just after 10:00 UTC.

At 11:00 UTC on Friday, July 13, Mailgun engineering began receiving alerts of problems with several services. Our initial investigation suggested that the problem was related to this software change released earlier in the day, and we initiated immediate efforts to roll back that release.

Continued investigation revealed that, despite the roll back, our authentication services were still not responding in a timely manner. Authentication (and related) services were restarted, and systems began to resume normal operations. By 12:44 UTC, all services were fully functional again.

Why did this happen? What did you do about it?

Before this release, we had deployed an unrelated set of changes to the authentication service. This introduced additional latency to the authentication flow and reduced the rate at which requests could be serviced. Combined with the additional load generated by our updated services, the queue of authentication requests grew faster than they could be serviced. Additionally, failed requests were being retried, which further compounded the load problem.

We worked to reduce the impact and took several immediate measures to restore services by:

reducing authentication load by reverting the most recently updated service

removing the circular dependency to reduce latency

restarting authentication services to clear request backlog

Lessons learned

Mailgun engineering has performed a comprehensive root cause analysis of this incident, and we have identified several actions we’ll be taking to reduce the likelihood of future incidents.

In addition to code and configuration changes made to remove unnecessary response latency, we are also in the process of formalizing SLOs. This will help increase our visibility into service latency and introduce more comprehensive data collection, monitoring, and alerting to aid in SLO enforcement.

We are also developing tooling to identify potential problem areas earlier in the development and release cycle in order to keep incidents like this from impacting our customers.

We really appreciate the understanding from our customers while we worked to resolve the issue quickly. We’d be happy to answer any questions or address concerns for impacted accounts – just open a support ticket, and our team will get back to you.

Josh Odom

President of the CPaaS unit at Sinch

Mailgun authentication service: Post mortem July 2018

This was originally posted on July 18, 2018.

Table of contents

This is what happened

Why did this happen? What did you do about it?

Lessons learned

Related readings

Announcing new analytics features to maximize your email performance

The golden age of scammers: AI-powered phishing

What we've been up to: Mailgun's 2019 year in review

Popular posts

Build Laravel 11 email authentication with Mailgun and Digital Ocean

Sending email using the Mailgun PHP API

Here’s everything you need to know about DNS blocklists

Products

Solutions

Enterprise

Resources

Help

Company