IT & Engineering
Software bugs and how to fix them faster
The true cost of software isn't in the creation of new software, but in its maintenance. If this is true, then the strategies we employ to help reduce overall costs lie in day-to-day code maintenance. Don’t believe us? Keep reading.
The cost of debugging isn’t the same for everyone. Cost doesn’t just depend on operation and service fees, but on how much technical debt you have.
When we talk about technical debt, we’re talking about the cost incurred when businesses do not fix problems that will affect them in the future. This can get expensive, but what if there are patterns or strategies that developers can leverage to reduce the time it takes to identify and resolve bugs?
In this article, we’re putting our money where our mouth is and sharing one strategy that we’ve found to be most effective in reducing the time to resolve bugs.
Table of content
Story time: How we learned this at Mailgun
An example: How we did this at Mailgun
Table of content
What is a software bug?
A software bug is just a coding error in a program. Bugs are common enough in a single program or application but if we zoom out to look at an entire platform of integrated services and interconnected code, the prospect of bug hunting starts to resemble that old saying about needles and haystacks.
Fixing a bug takes longer than writing a line of code. So, it makes sense to invest in your software development from the beginning. Right?
The cost of success: Software development processes
Setting yourself up for success comes at a high up-front cost both in terms of time and resources. Custom software development can have an average price tag of up to $250k depending on your infrastructure and needs, so it’s cost-critical to plan your maintenance strategy vs incurring the cost of constantly creating new code.
Modeling your production environment
You have as much chance of reproducing a bug in a local or controlled environment as a rat does of surviving at a cat farm – unless you have modeled your production environment as closely as possible.
This solution can take many forms, including the common practice of creating a staging or development environment. But these are expensive, and long-lived environments tend to drift over time. In Derrick’s 20 years of experience as a developer, nothing comes close to the ability to reproduce a problem on your local box. This means you need the ability to model the production environment locally in the most realistic and reproducible manner possible.
The most successful implementations of this we have seen are in architectures that use micro-services or, as we like to call them, “domain scoped services”. In these environments, the goal is to stand up each dependent service so that the service you are working on can't tell the difference between running on your local box and running in production.
How can you make this happen? The easiest way is to require service owners to provide an easy to use fake or mock implementation of their service's public interface. This is a perfect example of setting yourself up for success. Creating these mockups takes time and resources. If you have a developer culture of providing these tools, the goal of modeling your production environment in code becomes much easier as most of the hard work of implementing a fake or mock service is done for you. If not, you have to pay the price of creating it yourself.
If your organization is using Golang, you can import mock services directly from the dependent service's code, which makes keeping up to date with changes simple.
When mocking your services, you can’t just recreate external interfaces, you also have to include the transport protocols. You might be asking yourself, why? Wouldn’t it be better to avoid making costly remote calls all together? Would it be cheaper? Yes, but it wouldn’t be effective. The goal is to model the production environment as closely as possible; this includes the transport your app uses to communicate with other services.If it be HTTP, GRPC, or plan ole TCP, minor variations in the transport stack or marshaling libraries can have very subtle impacts on how your code operates in production.
Story time: How we learned this at Mailgun
We had a service that switched JSON marshalling libraries in an effort to increase performance, but this unintentionally introduced a unicode parsing issue. Mocking the interface instead of actually calling the transport and marshall function would have hidden this issue from us.
Another unexpected issue was one that involved a change to the Golang DNS library, which would have completely broken our production environment if not for our functional suite – and yes, we run a DNS mock implementation for our tests.
Creating functional tests to find edge cases and diagnose issues
Okay, so once we’ve modeled our environment, the first thing you want to do is create a functional testing suite. A functional testing suite is a container that holds a set of tests designed to aid in executing and reporting test execution statuses. You can add test cases and plans to your suites to cover a variety of scenarios and edge cases. The better your suite, the easier it will be when you try diagnosing an issue.
If the issue you are trying to solve exists as a functional test, you can easily (and quickly) replace the in-test values with the exact values found in production. Just reproducing the issue locally by re-creating the exact data from production can lead to the solution. If you have made it easy to import data from production or simulate data from production, this can be an invaluable tool in your belt.
If you don't have an existing functional test covering the scenario (which is common, since users often find annoying err... unique and unanticipated ways of using your system), you will have to create it.
Self portrait of a frustrated engineer who has not created functional testing.
Part of setting ourselves up for success involves conducting functional tests or having the ability to quickly create a functional test. Unit tests (testing the smallest piece of code that can be logically isolated in a system) can be useful once you have narrowed down the issue. A functional test allows you to draw broad strokes across your service and understand from the customer's perspective how the service operates under a particular scenario. When debugging, testing the product is always more important than testing the code.
Create automated functional tests within your suites that will execute test cases automatically. Use this as part of your QA program to uncover and fix bugs before your application goes to production.
Why you should avoid manual testing and what to do instead
The opposite to functional testing is manual testing. Manual testing is a type of software testing in which test cases are executed manually by a tester without using any automated tools. You should always focus on functional testing over manual testing since a manual test is error prone and not reproducible.
Many programmers make the mistake of running a service locally and manual testing endpoints to diagnose issues. Having done this for many years, we can't tell you first-hand how many times we've reproduced an issue without knowing exactly how we did it, which has led to much flipping of tables while attempting to retrace my steps.
Debugging sessions often go from hours to days, and with manual testing all the scenarios we try go undocumented and blur together. However, if all these attempted scenarios are written as functional tests, each scenario can become a part of the history of the debugging session.
If you write functional tests for new scenarios, they can become part of the regression suite of tests run during CI.
There is a secret to functional testing. Writing functional tests to diagnose issues is a key strategy to minimizing time to bug resolution. However, this method of diagnosis doesn't work if writing the functional test is difficult. The solution? Simplify things from step one.
Create a suite of helper functions and simple asserts. The goal is to make actions like retry loops and importing suites of data VERY easy to do. These helper functions should be among the first tools created in the initial write of the software to support new functional tests down the line.
An example: How we did this at Mailgun
The primary /messages API at Mailgun is handled by a service we call influx. This service ingests both HTTP and SMTP messages in our system and talks to just about every other service in our message sending suite of services.
As a result, we have invested heavily in our functional testing suite and have modeled our production environment in code for this very public part of our service catalog. We have created over 700 tests, most of which are functional and complete in around six minutes on a local box.
Many of these tests were added during debugging or diagnostic sessions and were then added to our suite of tests. If a new scenario is needed, it is added to the functional suite and forms a protection against future regressions.
The true cost of software development is in the maintenance and diagnosis of code. That’s why setting yourself up for success by creating a suite of tools you can use to decrease the time to resolution is worth the extra time and money.
That early investment will continue to pay dividends the longer the project continues.
Did you like our take on finding and diagnosing software bugs? If you want more content like this, don’t forget to subscribe to our newsletter.
Keep me posted! Get our news and tips every week.
Build Laravel 10 email authentication with Mailgun and Digital Ocean
When it was first released, Laravel version 5.7 added a new capability to verify user’s emails. If you’ve ever run php artisan make:auth within a Laravel app you’ll know the...
Here’s everything you need to know about DNS blocklists
The word “blocklist” can almost seem like something out of a movie – a little dramatic, silly, and a little unreal. Unfortunately, in the real world, blocklists are definitely something you...