Machine Learning for Everyday Tasks

July 16, 2019

Machine learning is often thought to be too complicated for everyday development tasks. We often associate it with things like big data, data mining, data science, and artificial intelligence. Sometimes it feels something like this:

Machine Learning is hard Real life example Parsing HTML from public internet

Cluster analysis to the rescue

Collect a dataset

Determine the number of clusters

Clustering

Classification

Validation Lessons Learned Useful links

Machine Learning is hard

I have always felt like we can benefit from using machine learning for simple tasks that we do regularly.

Real life example

At Mailgun, we work with e-mail and as part of our offering, we parse HTML quotations. This allows a user to grab the latest reply instead of the entire conversation, which is returned as part of our webhook response. You can read more about how we handle inbound message processing in our documentation.

For those of you who don’t know, here’s what parsing HTML from the public Internet looks like:

Parsing HTML from public internet

It’s messy and sometimes processes get stuck.

Changing the parsing library can help, but it won’t solve the issue completely because every library has its limitations. You have to restrict the parsing to something reasonable.

Cluster analysis to the rescue

But what should the criteria and threshold be? Should we limit by HTML length or tag count? Maybe both? Maybe by something else? The objective obviously is to process as many messages as possible without shooting yourself in the foot, but the path isn’t super obvious.

That’s where cluster analysis and statistical classification become handy. For the research I was using R, but depending on your task and personal preferences you might use something else – scikit-learn, Weka, MOA, etc.

Collect a dataset

First, we logged HTML length, tags count, message processing time and put them into a csv file:

                            

                                htmllen,tagscount,took rn2893762,85527,34.300139904 rn31378,518,0.0368919372559 rn19105,413,0.0545339584351 rn...

The vast majority of messages take fractions of a second to process. So, when collecting the dataset, we had to make sure that we have enough “slow” messages.

We ended up with two csv files, collected on different days. One had 13831 lines and was reserved for analysis and model-training (the train dataset). Another had 12149 lines and was reserved for model validation (the validation dataset).

Generally you want to have at least two datasets – one for training and one for validation. Otherwise you might run into overfitting problem, when your model is well adjusted to the train data but fails in the real world.

Determine the number of clusters

To visualize the data and look for patterns k-means clustering was first tried:

                            

                                messages u003c- read.csv(file=u0022messages080816.csvu0022, sep=u0022,u0022, head=TRUE) rn    mydata u003c- matrix(messages$took, ncol=1) rnrn    ## Determine the number of clusters rn    wss u003c- (nrow(mydata)-1)*sum(apply(mydata,2,var)) rn    for (i in 2:15) wss[i] u003c- sum(kmeans(mydata, rn        centers=i)$withinss) rn    plot(1:15, wss, type=u0022bu0022, xlab=u0022Number of Clustersu0022, rn        ylab=u0022Within groups sum of squaresu0022)

As you can see there is a significant performance improvement up to 4 clusters. After that there is no real boost.

Clustering

The next step was to figure out how the data points get distributed between the clusters:

                            

                                 ## K-Means Clustering with 4 clustersrn    fit u003c- kmeans(mydata, 4) rnrn    ## Cluster Plot against 1st 2 principal components rn    ## vary parameters for most readable graph rn    library(cluster) rn        clusplot(mydata, fit$cluster, color=TRUE,rn           shade=TRUE, labels=2, lines=0)

And you can somewhat anticipate the problem already: the clusters form by nipping off the datapoints that are far away, while the interval we’re interested in (1-20 sec) is in the very midst. The issue persists with increasing the number of clusters.

Moreover there is a significant overlap between the clusters in the interval:

                            

                                 ## get clusters mean, min, max rn    mean u003c- aggregate(mydata,by=list(fit$cluster),FUN=mean) rn    min u003c- aggregate(mydata,by=list(fit$cluster),FUN=min) rn    max u003c- aggregate(mydata,by=list(fit$cluster),FUN=max)

Compare clusters number 1 and 4. The issue persists with increasing the number of clusters.

At this point, we decided to try a different approach and look at the percentiles for message processing time:

                            

                                 percentiles u003c- quantile(messages$took, seq(0.5, 0.99, 0.01)) rn    plot(seq(0.5, 0.99, 0.01), percentile)

As you can see, after the 78th percentile the processing time quickly bubbles up. Here was our first threshold – 78th percentile that corresponded to 6.5 seconds.

All datapoints that took less than 6.5 sec were marked as “fast” and others as “slow”.

Classification

For classification, we tried SVM (Support Vector Machines), Random Forests and CART(Classification And Regression Tree).

CART showed slightly better results for this task but its main advantage is that it gives you a decision tree that is easy to understand, explain and implement vs SVM or Random Forests that work like a black box and require using heavy ML libraries in production.

Here’s how you classify using CART:

                            

                                x u003c- subset(messages, select=-took) rnlibrary(rpart) rnrn# grow tree rnfit u003c- rpart(x$cls ~., method=u0022classu0022, data=x) rnrn## display the results rnprintcp(fit)rnrn# detailed summary of splits rnsummary(fit) rnrn# plot tree rnplot(fit, uniform=TRUE, rn    main=u0022Classification Tree for message processing timeu0022) rntext(fit, use.n=TRUE, all=TRUE, cex=.8)

Here’s the decision tree:

Classification tree for message processing time

And that’s how the implementation looks like:

                            

                                def html_too_big(s): rn    return s.count('u003c') u003e _MAX_TAGS_COUNT

Isn’t it beautiful? All the research complexity in a single line!

Validation

For evaluation we took the validation dataset and tried to predict the classification using our model:

                            

                                 validate u003c- read.csv(file=u0022messages080916.csvu0022, sep=u0022,u0022, head=TRUE) rn    validate u003c- subset(validate, select=-took) rn    xv u003c- subset(validate, select=-cls) rn    y u003c- predict(fit, xv, type=u0022classu0022) rn    table(validate$cls, y)

Here’s the confusion matrix and some common classification metrics:

Lessons Learned

Machine learning is not just for data scientists. Even simple decisions powered by ML can benefit you. Know your data. Do not rely blindly on scientific algorithms and models.

Useful links

Great explanation of a train-validate-test workflow
Quick guide on CART and Random Forests in R
Quick guide on Cluster Analysis in R
How to plot in R
Some common classification metrics
Precision and Recall explained

Happy machine learning and data mining!

Author: Sergey Obukhov Software Developer at Sinch Mailgun

Cookie Subgroup	Cookies	Cookies used
documentation.mailgun.com	_cfuvid , __cfruid	First Party
.mailgun.com	OptanonConsent	First Party
mailgun.com	actualOptanonConsent , apt.sid , OptanonAlertBoxClosed , mail_session	First Party
app.mailgun.com	connect.sid , SERVERID	First Party
hello.mailgun.com	uvts , __cf_bm	First Party
m.stripe.com	m	Third Party

Cookie Subgroup	Cookies	Cookies used
hello.mailgun.com	ubpv , ubvs	First Party
dev.mailgun.com	_an_uid	First Party
app.mailgun.com	rl_page_init_referring_domain , rl_anonymous_id , rl_group_trait	First Party
mailgun.com	_vwo_ds , test_rudder_cookie , _vis_opt_s , rl_group_id , rl_user_id , _ga , rl_session , rl_page_init_referrer , ubvt , _vwo_uuid , apt.uid , optimizelyEndUserId , _gat , _vwo_sn , _ga_xxxxxxxxxx , _gid , _uetvid , _vis_opt_test_cookie	First Party
hello.learn.mailgun.com	visitor_id	Third Party

Cookie Subgroup	Cookies	Cookies used
hello.mailgun.com	_gd_session	First Party
mailgun.com	__q_state_zkTi4FmbUJniF8K2 , _vwo_uuid_v2 , apt.temp-xxxxxxxxxxxxxxxxxx , __tld__	First Party
demo.mailgun.com	_gd_visitor	First Party
dev.mailgun.com	_pin_unauth , __uvt	First Party
app.mailgun.com	__stripe_mid , __stripe_sid	First Party
mailgun.zendesk.com	_cfuvid, __cf_bm, __cfruid	Third Party
vimeo.com	_cfuvid, __cf_bm, vuid	Third Party
producthunt.com	__cf_bm	Third Party
goldcast.io	__cf_bm	Third Party

Cookie Subgroup	Cookies	Cookies used
www.mailgun.com	pardot	First Party
app.mailgun.com	rl_trait	First Party
hello.mailgun.com	visitor_id	First Party
mailgun.com	_gat_gtag_xxxxxxxxxxxxxxxxxxxxxxxxxxx , _uetsid , _fbp , _tt_enable_cookie , _ttp , _rdt_uuid , __q_domainTest , _gcl_au	First Party
linkedin.com	bcookie, lidc, li_gc	Third Party
pi.pardot.com	pardot, lpv830283	Third Party
bing.com	MSPTC, MUID	Third Party
hello.learn.mailgun.com	pardot	Third Party
pardot.com	visitor_id	Third Party
doubleclick.net	IDE, test_cookie	Third Party
youtube.com	VISITOR_PRIVACY_METADATA, VISITOR_INFO1_LIVE, __Secure-xxxxxxx	Third Party
www.google.com	_GRECAPTCHA	Third Party

Table of contents

Machine Learning is hard

Real life example

Parsing HTML from public internet

Cluster analysis to the rescue

Collect a dataset

Determine the number of clusters

Clustering

Classification

Validation

Lessons Learned

Useful links

Related articles

Send email using Python3 and the Mailgun API

How to prepare your Infrastructure for Black Friday

What are SYN flood attacks and how can you defend against them?