Product

Open sourcing our email signature parsing library

Back in 2011 we had several customers ask us for a high level message parsing API that they could use to strip signatures and quotes from an email like you see below:

July 16, 2019

Back in 2011, we had several customers ask us for a high level message parsing API that they could use to strip signatures and quotes from an email like you see below:

The problem

While simple for humans, this is actually quite a challenging task for machines. One of the main reasons is because there is no standard format for an email message. Different email clients compose replies in different manners and even within the same email client, the sender can change the format to whatever they choose. For example, users can place their reply after quoting the original message (bottom-posting):

At 10.01am Wednesday, Danny wrote: > By the way, which systems will be updated? I had some network > problems after last week's update. Will I have to reboot? No, you won't have to reboot.

before the quoted original message (top-posting):

No, you won't have to reboot. -------- Original Message -------- From: Danny Sent: Tuesday, October 16, 2007 10:01 AM To: Jim Subject: RE: Job By the way, which systems will be updated? I had some network problems after last week's update. Will I have to reboot?

or even interleave their reply:

> Can you present your report an hour later? Yes I can. The summary will be sent no later than 5pm. Jim At 10.01am Wednesday, Danny wrote: >> 2.00pm: Present report > Jim, I have a meeting at that time. Can you present your report an hour later?

In fact, there are so many different ways to reply, there is even a Wikipedia article about it! All of this makes parsing the body of an email a challenging task.

Even with machine learning, we had to constantly adjust things. Email formatting is constantly changing, phone email clients are introducing new signatures like “Sent from your XXX phone”, new edge cases are discovered, etc.

Here is a simple example. Back in the day, all email signatures were separated with dashes:

Mockup of email signature separated by dashes

So the first thing that comes to mind is to write a regex to detect dashes as a signature splitter and extract lines after it as a signature:

>>> signature = regex.match("^[s]*--*[s]*[a-z .]*$).*", message)

But the next thing you know you get an email like this:

Dashes used in a list included in an email

And your parser strips off the most important part of the email. It’s a very simple example and you could easily work around it. But in real life, things get much more complicated and tricky.

Our Solution

We did a lot of research, looked at all the variations of email that passes through Mailgun and came up with a solution based on some machine learning techniques. The solution has been in production for several years now, undergoing bug fixes and enhancements. Overall we have received positive feedback from customers, though naturally, developers tend to point out where you could improve.

So now you all have the chance to help improve the solution. Because of the constantly changing and distributed landscape of email, we’ve decided to tackle this problem with a distributed solution: we’re open sourcing our library so we can hack on this together!

We’re calling our new library talon after a multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments.

In case you want to start testing it right away, we’ve prepared a simple Demo app and a QuickStart Guide for you. Otherwise read on for a more general overview, approaches we took, and assessment results.

Here’s how most common workflows look like:

Currently, we use machine learning only to classify signature lines. The rest of the library are various heuristics and sanity checks we came up with while working on support tickets and analyzing message formatting patterns/trends.

The machine learning part of the library is inspired by the following research papers:

To classify signature lines we used SVM with Linear Kernel. To assess our classifiers we used 5-fold cross-validation:

The dataset consisted of 2912 email lines. Out of 1030 signature lines 954 were classified correctly. Out of 1882 non-signature lines, 147 were mistaken for signature. Overall it gives us 92% success rate and 78% area under the ROC curve. Which could be regarded as excellent and fair correspondingly.

When we modified the library for outsourcing, we tried to provide a sturdy skeleton while making it easy to add more meat. From experience, the parts that could use the most focus are the regexps for quotations / signature separators and HTML quotations extraction by HTML tags. However, you are certainly welcome to contribute to any part of the library you like.

We hope that you’ll find the library useful and it makes your life easier.

Happy Sending!

Author: Sergey Obukhov Software Developer at Sinch Mailgun

Cookie Subgroup	Cookies	Cookies used
documentation.mailgun.com	_cfuvid , __cfruid	First Party
.mailgun.com	OptanonConsent	First Party
mailgun.com	actualOptanonConsent , apt.sid , OptanonAlertBoxClosed , mail_session	First Party
app.mailgun.com	connect.sid , SERVERID	First Party
hello.mailgun.com	uvts , __cf_bm	First Party
m.stripe.com	m	Third Party

Cookie Subgroup	Cookies	Cookies used
hello.mailgun.com	ubpv , ubvs	First Party
dev.mailgun.com	_an_uid	First Party
app.mailgun.com	rl_page_init_referring_domain , rl_anonymous_id , rl_group_trait	First Party
mailgun.com	_vwo_ds , test_rudder_cookie , _vis_opt_s , rl_group_id , rl_user_id , _ga , rl_session , rl_page_init_referrer , ubvt , _vwo_uuid , apt.uid , optimizelyEndUserId , _gat , _vwo_sn , _ga_xxxxxxxxxx , _gid , _uetvid , _vis_opt_test_cookie	First Party
hello.learn.mailgun.com	visitor_id	Third Party

Cookie Subgroup	Cookies	Cookies used
hello.mailgun.com	_gd_session	First Party
mailgun.com	__q_state_zkTi4FmbUJniF8K2 , _vwo_uuid_v2 , apt.temp-xxxxxxxxxxxxxxxxxx , __tld__	First Party
demo.mailgun.com	_gd_visitor	First Party
dev.mailgun.com	_pin_unauth , __uvt	First Party
app.mailgun.com	__stripe_mid , __stripe_sid	First Party
mailgun.zendesk.com	_cfuvid, __cf_bm, __cfruid	Third Party
vimeo.com	_cfuvid, __cf_bm, vuid	Third Party
producthunt.com	__cf_bm	Third Party
goldcast.io	__cf_bm	Third Party

Cookie Subgroup	Cookies	Cookies used
www.mailgun.com	pardot	First Party
app.mailgun.com	rl_trait	First Party
hello.mailgun.com	visitor_id	First Party
mailgun.com	_gat_gtag_xxxxxxxxxxxxxxxxxxxxxxxxxxx , _uetsid , _fbp , _tt_enable_cookie , _ttp , _rdt_uuid , __q_domainTest , _gcl_au	First Party
linkedin.com	bcookie, lidc, li_gc	Third Party
pi.pardot.com	pardot, lpv830283	Third Party
bing.com	MSPTC, MUID	Third Party
hello.learn.mailgun.com	pardot	Third Party
pardot.com	visitor_id	Third Party
doubleclick.net	IDE, test_cookie	Third Party
youtube.com	VISITOR_PRIVACY_METADATA, VISITOR_INFO1_LIVE, __Secure-xxxxxxx	Third Party
www.google.com	_GRECAPTCHA	Third Party

Open sourcing our email signature parsing library

The problem

Our Solution

Related articles

How Mailgun connects to your favorite tools: Introducing Mailgun's new Integrations Directory

Sending email using the Mailgun PHP API

New Mailgun Zapier Integration: Validations, alerts, and AI-powered workflows