Product

Open sourcing our email signature parsing library

Back in 2011 we had several customers ask us for a high level message parsing API that they could use to strip signatures and quotes from an email like you see below:

PUBLISHED ON

PUBLISHED ON

Back in 2011, we had several customers ask us for a high level message parsing API that they could use to strip signatures and quotes from an email like you see below:

Table of content

Mockup of an email conversation

The problem

While simple for humans, this is actually quite a challenging task for machines. One of the main reasons is because there is no standard format for an email message. Different email clients compose replies in different manners and even within the same email client, the sender can change the format to whatever they choose. For example, users can place their reply after quoting the original message (bottom-posting):

At 10.01am Wednesday, Danny wrote: > By the way, which systems will be updated? I had some network > problems after last week's update. Will I have to reboot? No, you won't have to reboot.

before the quoted original message (top-posting):

No, you won't have to reboot. -------- Original Message -------- From: Danny <danny@example.com> Sent: Tuesday, October 16, 2007 10:01 AM To: Jim <jim@example.com> Subject: RE: Job By the way, which systems will be updated? I had some network problems after last week's update. Will I have to reboot?

or even interleave their reply:

> Can you present your report an hour later? Yes I can. The summary will be sent no later than 5pm. Jim At 10.01am Wednesday, Danny wrote: >> 2.00pm: Present report > Jim, I have a meeting at that time. Can you present your report an hour later?

In fact, there are so many different ways to reply, there is even a Wikipedia article about it! All of this makes parsing the body of an email a challenging task.

Even with machine learning, we had to constantly adjust things. Email formatting is constantly changing, phone email clients are introducing new signatures like “Sent from your XXX phone”, new edge cases are discovered, etc.

Here is a simple example. Back in the day, all email signatures were separated with dashes:

Mockup of email signature separated by dashes

So the first thing that comes to mind is to write a regex to detect dashes as a signature splitter and extract lines after it as a signature:

>>> signature = regex.match("^[s]*--*[s]*[a-z .]*$).*", message)

But the next thing you know you get an email like this:

Dashes used in a list included in an email

And your parser strips off the most important part of the email. It’s a very simple example and you could easily work around it. But in real life, things get much more complicated and tricky.

Our Solution

We did a lot of research, looked at all the variations of email that passes through Mailgun and came up with a solution based on some machine learning techniques. The solution has been in production for several years now, undergoing bug fixes and enhancements. Overall we have received positive feedback from customers, though naturally, developers tend to point out where you could improve.

So now you all have the chance to help improve the solution. Because of the constantly changing and distributed landscape of email, we’ve decided to tackle this problem with a distributed solution: we’re open sourcing our library so we can hack on this together!

We’re calling our new library talon after a multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments.

In case you want to start testing it right away, we’ve prepared a simple Demo app and a QuickStart Guide for you. Otherwise read on for a more general overview, approaches we took, and assessment results.

Here’s how most common workflows look like:

A workflow chart for common workflows

Currently, we use machine learning only to classify signature lines. The rest of the library are various heuristics and sanity checks we came up with while working on support tickets and analyzing message formatting patterns/trends.

The machine learning part of the library is inspired by the following research papers:

To classify signature lines we used SVM with Linear Kernel. To assess our classifiers we used 5-fold cross-validation:

Code for cross validation

The dataset consisted of 2912 email lines. Out of 1030 signature lines 954 were classified correctly. Out of 1882 non-signature lines, 147 were mistaken for signature. Overall it gives us 92% success rate and 78% area under the ROC curve. Which could be regarded as excellent and fair correspondingly.

When we modified the library for outsourcing, we tried to provide a sturdy skeleton while making it easy to add more meat. From experience, the parts that could use the most focus are the regexps for quotations / signature separators and HTML quotations extraction by HTML tags. However, you are certainly welcome to contribute to any part of the library you like.

We hope that you’ll find the library useful and it makes your life easier.

Happy Sending!

Sign Up

It's easy to get started. And it's free.

See what you can accomplish with the world’s best email delivery platform.

Related readings

Agape charity finds easy way to forward email for volunteers worldwide

I volunteer part time at the Agape Community Care Association (www.agape.org), an NGO that helps orphans and other...

Read more

How Kanban2go fulfills Zawinski’s Law

I don’t know if that is true, but kanban2go took a step in that direction recently with the...

Read more

Python tutorial: How ImgPage lets you upload 25 MB photos to S3 and Cloud files with just an email

This post is written by Paul Finn. Paul is a Portsmouth, NH-based Python developer, business owner (http://route3software.com/) and...

Read more

Popular posts

Email inbox.

Build Laravel 10 email authentication with Mailgun and Digital Ocean

When it was first released, Laravel version 5.7 added a new capability to verify user’s emails. If you’ve ever run php artisan make:auth within a Laravel app you’ll know the...

Read more

Mailgun statistics.

Sending email using the Mailgun PHP API

It’s been a while since the Mailgun PHP SDK came around, and we’ve seen lots of changes: new functionalities, new integrations built on top, new API endpoints…yet the core of PHP...

Read more

Statistics on deliverability.

Here’s everything you need to know about DNS blocklists

The word “blocklist” can almost seem like something out of a movie – a little dramatic, silly, and a little unreal. Unfortunately, in the real world, blocklists are definitely something you...

Read more

See what you can accomplish with the world's best email delivery platform. It's easy to get started.Let's get sending
CTA icon