What Is Outbound Spam & How To Prevent It

June 14, 2022

Spam sucks. That’s the long and short of it.

We’ve talked about how to identify and prevent spam, but what if you’re part of the problem?

We’re talking about outbound spam exiting your email server. In this article, we’ll go over what outbound spam is and how to fight outbound spam. Then, we’ll dive into our machine learning tutorial on how to detect and prevent outbound spam.

What is outbound spam?

How can I fight outbound spam?

How can I prevent outbound spam with machine learning?1. Build a dataset 2. Fit different models to our data 3. Validate our model

Wrapping up

What is outbound spam?

Outbound spam is exactly what it sounds like: spam that exits your network and lands in the inbox of unsuspecting customers or subscribers. This happens when spammers infiltrate your network and exploit it to send spam.

Email service providers (ESPs) like Mailgun and businesses like you need to watch out for outbound messages considered spam because it can destroy your IP reputation and lead to IP blocklisting. This means that a recipient’s internet service provider (ISP) has detected spammy behavior and has decided to block all messages, legitimate or not, that originate from your IP address.

This is terrible news for your email programs or marketing campaigns. After all, you need emails to land in your subscriber’s inbox so they can open your messages and engage with your brand.

How can I fight outbound spam?

Fighting outbound spam takes constant vigilance. For instance, you can use outbound spam filters. These filters installed in your own network can be configured to identify individual senders based on authentication protocols. This way, you can track who is sending emails on your network and identify spam-like behavior.

How can I prevent outbound spam with machine learning?

Spam identification and prevention techniques usually rely on vigilant email users or third-party software. However, spam sometimes slips through the cracks because it’s disguised. Spam doesn’t always look like spam.

Sometimes scammers pretend to be legitimate online retailers by using a fake website or a fake ad on a real shopping site. In these cases, both the email messages and embedded links will look legitimate even though they belong in your junk folder.

How can we make spam detection better? At Mailgun, we believe in using machine learning to power everyday tasks, such as parsing HTML quotations.

Below, we’ll walk through how you can apply machine learning to build a model to detect spam. We’ll start by building a dataset. Then, we’ll try to fit different models to the data. Lastly, we’ll validate our model. We used R for our analysis, but you can use another tool like scikit-learn, Weka, or MOA.

1. Build a dataset

First, let’s take a look at how spammers behave. We’ll use their behavioral pattern to build a model to identify when our outgoing mail appears like spammy behavior.

To differentiate between spam that looks legitimate and actually legitimate messages, we focused on the speed with which spammers send a particular amount of spam.

We classified over 1000 email accounts and collected the following data:

timepassed: The time lapsed before an account starts sending messages
time2send: The time it takes an account to send a particular number of messages
class: Whether an account is legitimate or spam

Here is a sample entry from our dataset:

                            

                                timepassed,time2send,class  
252202,961501,legitimate  
391006,11291,spam  
...

It’s just a CSV file.

For the analysis, I was using R but depending on your task and personal preferences you might use something else – scikit-learn, Weka, MOA, etc.

Two-thirds of the dataset were reserved for training and one third for validation:

                            

                                x <- read.csv(file="firstXmessages.csv", sep=",", head=TRUE)

## pick rows classified as spam
spam <- x[x$class %in% c("spam"), ]

## shuffle the data points
totalspam <- nrow(spam)  
spam <- spam[sample(totalspam), ]

## keep 1 / 3 for validation and 2 / 3 for training
validatespamrows <- totalspam / 3  
validatespam <- spam[sequence(validatespamrows), ]  
trainspam <- spam[validatespamrows + sequence(totalspam - validatespamrows), ]

## repeat for legitimate domains
legit <- x[x$class == "legitimate", ]  
totallegit <- nrow(legit)  
legit <- legit[sample(totallegit), ]  
validatelegitrows <- totallegit / 3  
validatelegit <- legit[sequence(validatelegitrows), ]  
trainlegit <- legit[validatelegitrows + sequence(totallegit - validatelegitrows), ]

## merge legitimate and spam datapoints together and shuffle
train <- rbind(trainspam, trainlegit)  
train <- train[sample(nrow(train)), ]

validate <- rbind(validatespam, validatelegit)  
validate <- validate[sample(nrow(validate)), ]  
validatex <- subset(validate, select=-class)

Now, we’re ready to model our data.

2. Fit different models to our data

After building our dataset, we tried out different models for classification analysis. Classification analysis identifies and assigns categories to a data collection to allow for more accurate analysis. In this case, we’re trying to identify spam and legitimate emails.

First, we tried the Support Vector Machines (SVM) model with a linear kernel that transforms data in a linear form. The visualization plot below gives you a good understanding of the data points distribution:

Support Vector Machines graph — *Support Vector Machines model with a linear kernel showing the distribution of spam and legitimate emails.*

The X-axis is “time to send” and the Y-axis is “time lapsed before sending”. Red marks indicate spam email data points, and black marks indicate legitimate email data points. From this chart, it seems that spammers start sending sooner and send faster.

However, this model didn’t pass validation. We’ll talk more about validation in the next section. For now, let’s talk about the second model we tried: Classification And Regression Tree (CART).

                            

                                library(rpart)

## grow tree 
fit <- rpart(train$class ~., method="class", data=train)

## plot tree 
plot(fit, uniform=TRUE,  
      main="Classification Tree for how fast spammers send")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

Classification and regression tree model — *Classification And Regression Tree model for spam and legitimate emails.*

This looks promising. When each stated condition is true, we move to the left branch. As you can see, as time2send increases, so do the fraction of legitimate emails. The opposite is true for spam emails, similar to what we saw in the SVM model above.

Let’s validate our model in the next section below.

3. Validate our model

We want to assess our model using Precision and Recall performance metrics.

Precision and Recall explainer (source)

“Precision” refers to the fraction of relevant instances among the retrieved instances. This means we can determine how high Spam Precision is based on the fraction of spam accounts among all the accounts retrieved by our model.

“Recall” refers to the fraction of relevant instances that were retrieved. We can determine how high Spam Recall is based on the fraction of spam instances retrieved.

To create a good spam filter, we can allow the occasional spam message, but we should not prevent any legitimate emails from being delivered. We want to focus on the following metrics for our model to ensure we can accurately distinguish between spam and legitimate emails:

High Spam Precision means that most accounts classified as spam are indeed spam.
High Legitimate Recall means that we don’t misclassify legitimate accounts as spam.
High Spam Recall means that we catch the majority of spam accounts.

Below are the Precision and Recall metrics for our first SVM linear model:

                            

                                y <- predict(fit, validatex, type="class")  
confusionmatrix <- table(validate$class, y)  
print(Evaluate(cm=confusionmatrix))

Class       Precision    Recall  
legitimate  0.7031250 0.6000000  
spam        0.8814229 0.9214876

As you can see, we only have 60% Legitimate Recall. This is far too low.

Let’s see the validation for our second CART model:

                            

                                Class       Precision    Recall  
legitimate  0.8333333 0.6666667  
spam        0.9027237 0.9586777

The metrics look better, but Legitimate Recall still isn’t very high.

As a final step, we decided to see how our CART model fits into an SVM visualization chart, as shown below:

CART model on SVM chart — *SVM chart showing the CART model capturing most of the spam in the quadrants in the bottom left corner of the chart.*

The data points in emerged quadrants are pretty much all spam. This is what we needed: a simple model that would make sense and catch only spammers.

This model is a rough first step to using machine learning to catch spam. The next step would be to add more features to our dataset. We encourage you to incorporate machine learning into your everyday tasks, like in this walkthrough.

Wrapping up

We’ve given you some tips on how to stop your outbound emails from being outbound spam. We’ve also included a machine learning to model spam detection for advanced users as a bonus.

On the other side of the spam equation and finding that your Mailgun email campaigns get lost in your subscribers’ spam folders? Check out our blog post about staying on the good side of email providers.

Author: Sergey Obukhov Software Developer at Sinch Mailgun

Cookie Subgroup	Cookies	Cookies used
documentation.mailgun.com	_cfuvid , __cfruid	First Party
.mailgun.com	OptanonConsent	First Party
mailgun.com	actualOptanonConsent , apt.sid , OptanonAlertBoxClosed , mail_session	First Party
app.mailgun.com	connect.sid , SERVERID	First Party
hello.mailgun.com	uvts , __cf_bm	First Party
m.stripe.com	m	Third Party

Cookie Subgroup	Cookies	Cookies used
hello.mailgun.com	ubpv , ubvs	First Party
dev.mailgun.com	_an_uid	First Party
app.mailgun.com	rl_page_init_referring_domain , rl_anonymous_id , rl_group_trait	First Party
mailgun.com	_vwo_ds , test_rudder_cookie , _vis_opt_s , rl_group_id , rl_user_id , _ga , rl_session , rl_page_init_referrer , ubvt , _vwo_uuid , apt.uid , optimizelyEndUserId , _gat , _vwo_sn , _ga_xxxxxxxxxx , _gid , _uetvid , _vis_opt_test_cookie	First Party
hello.learn.mailgun.com	visitor_id	Third Party

Cookie Subgroup	Cookies	Cookies used
hello.mailgun.com	_gd_session	First Party
mailgun.com	__q_state_zkTi4FmbUJniF8K2 , _vwo_uuid_v2 , apt.temp-xxxxxxxxxxxxxxxxxx , __tld__	First Party
demo.mailgun.com	_gd_visitor	First Party
dev.mailgun.com	_pin_unauth , __uvt	First Party
app.mailgun.com	__stripe_mid , __stripe_sid	First Party
mailgun.zendesk.com	_cfuvid, __cf_bm, __cfruid	Third Party
vimeo.com	_cfuvid, __cf_bm, vuid	Third Party
producthunt.com	__cf_bm	Third Party
goldcast.io	__cf_bm	Third Party

Cookie Subgroup	Cookies	Cookies used
www.mailgun.com	pardot	First Party
app.mailgun.com	rl_trait	First Party
hello.mailgun.com	visitor_id	First Party
mailgun.com	_gat_gtag_xxxxxxxxxxxxxxxxxxxxxxxxxxx , _uetsid , _fbp , _tt_enable_cookie , _ttp , _rdt_uuid , __q_domainTest , _gcl_au	First Party
linkedin.com	bcookie, lidc, li_gc	Third Party
pi.pardot.com	pardot, lpv830283	Third Party
bing.com	MSPTC, MUID	Third Party
hello.learn.mailgun.com	pardot	Third Party
pardot.com	visitor_id	Third Party
doubleclick.net	IDE, test_cookie	Third Party
youtube.com	VISITOR_PRIVACY_METADATA, VISITOR_INFO1_LIVE, __Secure-xxxxxxx	Third Party
www.google.com	_GRECAPTCHA	Third Party

How to prevent outbound spam

Table of contents

What is outbound spam?

How can I fight outbound spam?

How can I prevent outbound spam with machine learning?

1. Build a dataset

2. Fit different models to our data

3. Validate our model

Wrapping up

Related articles

How to debug email issues in real-time using Mailgun logs

10 ways to improve and protect your sender reputation

DKIM pass, bounce anyway? Outlook's new mystery rejection