- What's new
Until recently, Mailgun has been in a managed colo environment hosted by our parent company, Rackspace, using bare metal servers. This decision goes back to many of the same philosophical reasons of why Rackspace launched OnMetal servers in July. We weren’t comfortable deploying Mailgun on the public cloud given the reliability and performance that you typically need to engineer around. This changed when Rackspace rolled out Performance Cloud Servers in November 2013.
In this post, we’ll provide some background on our rationale behind the change, and how the performance of our infrastructure changed, as a result.
We were pretty happy with our managed colo deployment, and we aren’t the only one. There are many arguments for using dedicated versus cloud, but the biggest advantage is having complete control of your hardware and networks, which can not be achieved by cloud deployments (yet). Other arguments against using cloud over dedicated include reliability and performance. It’s perceived that cloud is less robust and less performant when it comes to I/O-intensive loads.
A typical Mailgun deployment would look like this:
On this diagram, the majority of traffic hits a highly available (HA) pair of F5 5000s load balancers, which routes all traffic to our dedicated API servers.
These F5 load balancers are pretty performant hardware servers
They are capable of handling:
L7 requests per second: 750K
L4 connections per second: 350K
L4 HTTP requests per second: 3.5M
Maximum L4 concurrent connections: 24M
Throughput: 30 Gbps/15 Gbps L4/L7
In addition to that, these beasts deal with SSL termination and are capable of mitigating some DDOS attacks.
Dell R720s are Mailgun workhorses, used both as databases and processing servers. They are equipped with 64GB RAM, one or several SAS 15K RPM drives, depending on configuration, and one or two 10Gb/s NICs.
Cloud servers are used for auxiliary tasks, such as logging, creating backups and running various jobs, 90% of the environment was located on dedicated hardware.
Overall, the Mailgun team was pleased with the existing state of things. So, why did we migrate?
There are different clouds available, but we’ve been most excited about one particular cloud – Rackspace Performance Cloud. Here are the benchmarks comparing the performance of the previous generation servers to the new Performance Cloud Servers that show a huge improvement:
These benchmarks indicate that we can start using Cloud Servers to host our Cassandra clusters as SSDs are adding a huge boost in speed:
In addition to that, the new cloud boxes are more performant than the standard R720s that we’ve been setting up in our managed colo environments. E.g. the UnixBench score for our R720s is 1219 compared to an impressive 4876 on the new Performance Cloud Servers.
Rackspace Cloud Servers can be deployed rapidly and operated using an API which really helps to automate provisioning, auto scale, and all the usual perks of operating in the cloud. (Note: Rapid API based provisioning is available for bare metal servers through Rackspace’s OnMetal offering. We look forward to augmenting our infrastructure with those beefy servers in the future.)
We still had two major concerns that kept us back from migrating to the cloud.
Cloud networking may be a bottleneck, especially when it comes to high-frequency packet exchange. We’ve seen huge performance degradations when it came to using Redis when we’ve hit 5-6K packets per second.
In addition to that, throughput was also a major concern and 1Gb/s was not enough.
This was a major roadblock for us until Rackspace released dual, bonded 10Gbps non-virtualized networking for Performance Cloud Servers plus separate NICs for CBS. Our benchmarks showed that the new networking is robust and does not suffer from the degradations as the previous software networking did.
The internet contains many stories of cloud instability, bringing entire businesses down, so it’s kinda scary to rely on this. However, the Cloud Servers team promised robust reliability, relative to typical cloud providers, so we decided to test it out.
Our new environment was using the same type of load balancers, but in this case, connected via Rackspace’s RackConnect, and routing all traffic directly to the cloud.
We’ve spent several weeks load testing this link, hitting our SMTP and HTTP API servers, residing on the cloud, and haven’t noticed any performance degradations.
A typical Mailgun Performance Cloud Servers deployment uses around a hundred servers per region. For the last three months, two servers went down due to the problems with the host server, which was roughly equal to what we experienced on our dedicated servers when around 1 box out of 100 went down every month.
We should note though that Mailgun uses large and extra large (64GB and 128GB) flavors of Performance Cloud Servers, so you may observe different results while choosing smaller flavors.
One drawback that we’ve seen so far is the Rackspace policy for DC-wide maintenances. This may require your application to be multi-DC from the start if you can’t tolerate downtime, as it could lead to multi-second downtime for up a significant portion of an environment. We experienced one such maintenance in April, but thankfully, we have multiple environments to fall back on.
Overall, we have been quite pleased with our migration. We can now use the Rackspace Cloud API to provision our servers, with the additional benefit of a more performant fleet compared to our Managed colo.
Last updated on August 27, 2019