IT & Engineering
At Mailgun we have numerous systems generating a ton of events every hour of the day. It’s a number so large that it’s impossible for a team of people to sort through ElasticSearch results and expect consistent results or for the team to maintain their sanity.
Like many companies, we still need to find a way to deal with all the data. This makes for a substantial task: create a system that allows our internal customers to react to events in real-time.
Traditionally, options for solving such a problem include:
We pride ourselves on doing a lot with very little, so we set out to see how we could make real-time filters work for us. We focused on building a system that allows essentially anyone in the engineering and support teams to write rules that react and perform actions on their behalf.
When designing this system, we tried to build something that would give everyone the tools to rapidly iterate and respond to threats to our system and had to conform to the following rules:
We use Kibana internally, so something that could be tested against that to fulfill point number 3 was ideal. This leads us to a straight-forward conclusion: we should write a Lucene inspired DSL parser!
The first iteration of the system relied on regular expressions and lots of character-by-character parsing. It got the job done, but subsequent iterations rapidly grew in complexity. If you’ve ever done any large string parsing project, you know what we mean.
Moving forward, we knew we had to build something that could be iterated upon quickly, and ideally, conformed to a known standard.
Those of us that have taken Automata Theory college courses and the like probably remember grammars in the context of programming languages. Furthermore, it’s given that some readers will have far more expertise than your humble author when it comes to the subject, so any in-depth explanations are to be referred to your friendly neighborhood search engine.
It’s still useful to talk briefly about what makes a grammar and why they’re useful. To quote Wikipedia, a Formal Grammar is “…a set of production rules for strings in a formal language. The rules describe how to form strings from the language’s alphabet that are valid according to the language’s syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form.”
At the risk of oversimplifying, it lets us take a string like “You’ve got mail!” and break it down into tokens that we can semantically analyze. In other words, if our grammar is well defined, we can then write additional code to determine the meaning of that string.
The history and types of grammars can fill tons of books and are thus well beyond the scope of a blog post. Instead, we’ll zero in on context-free grammars and more specifically, parsing expression grammars.
Quoting Wikipedia again, a Context-Free Grammar is “a certain type of formal grammar: a set of production rules that describe all possible strings in a given formal language. Production rules are simple replacements.”
To put it another way, a context-free grammar defines a graph which in turn defines how to parse a language. As mentioned above, it does not tell us how to understand the language. It’s also worth mentioning as we’ll see below that the graph may have cycles, but it eventually terminates.
Context-free grammars have a cousin of sorts in the Parsing Expression Grammar, or PEG. They look a lot like a context-free grammar, or CFG, but come with a very useful distinction: when confronted with an ambiguous choice, the PEG will always choose the first production rule defined. The CFG instead leaves the choice ambiguous, and why bother with ambiguity when we don’t have to?
What this really means is that it’s easier to write tools to define PEGs. Fortunately, many people have already done that for us.
After some research, we settled on Pigeon as the basis of our PEG. Pigeon follows the same paradigm as a lot of Go tools by generating code that compiles alongside your program.
Using it is easy enough: You define both the grammar and the Go code to handle each of the rules in the same package, and simply call Parse() on the generated parser. The challenge is to write handlers for each of those rules. Like grammars, the topic of compilation could fill a library with books. Fortunately for us, writing an interpreter is less complex and the resulting code is sufficiently fast for our needs.
Furthermore, the syntax tries to align itself closely to Go syntax itself, which makes it easier when writing your PEG alongside your Go program utilizing the grammar.
Say you have an event generated by sending an email, encoded in JSON, that looks like the following:
{
"event": "sent",
"subject": "A special offer just for you!",
"account": "12345",
}
There are a couple of ways to look at this event: you could decide this is an innocuous email, or you could say “This might be spam, but there’s not enough information to decide”. This is a very common problem in our system, and always requires lots of aggregated data to come to a conclusion.
So, let’s write a rule that will match the above event:
event:"sent" AND subject:"A special offer just for you!"
This is fairly easy to parse, and better yet, it can also be checked in Kibana!
Now let’s walk through constructing a very simple PEG that would match this rule. We’ll implement each production rule one at a time and explain what it does. Rule names typically follow PascalCase as well as allowing numbers and underscores. Each rule takes the form:
RuleName <- <rule definition>
Note that are actually several legal characters for the rule definition operator but we like <- as it’s easy to type and looks most like grammars as they’re defined academically.
The first rule in the grammar is treated as the entrypoint:
Input <- Term !.
Our Input
rule says “Match the entire input string” with !.
meaning “Match the end of the file”. This will pass whatever the entire input string to a rule named Term which looks like the following:
Term <- Variable AndVar*
This passes control-flow to the Variable production rule:
Variable <- FieldChars+ _ ":" _ Value
Now we’re getting a little more complex. First, we greedily match all characters until the FieldChars rule is satisfied. Note the +
there at the end. PEG syntax shares a lot in common with regular expressions, so this is saying “Match the rule FieldChars
one or more times”
FieldChars <- [a-z]
Again, if you’re familiar with regex syntax this should be straight-forward: Simply match any single character between “a” and “z”
Now back to Variable, the next part _
is a rule saying “match any whitespace character”. Note that _
is a completely valid identifier for a production rule, and less typing is usually a good thing.
_ "whitespace" <- [ \n\t\r]*
There are a couple of things to note here:
The next part of the Variable rule matches the string literal “AND”. It doesn’t get much easier than that.
Next, we have another whitespace rule, and then finally the Value rule.
Value <- '"' ValueChars* '"'
This rule matches the string literal '"'
, zero or more ValueChars rules and lastly another quote.
ValueChars
looks like the following:
How we built a Lucene-inspired parser in Go
These ValueChars
will match the lowercase and uppercase alphabet, any number, spaces, and the exclamation point.
Why did we define Value this way? Because the rule can strip the double-quotes off for us so we don’t have to, plus we’re lazy. However, it’s merely a convenience we could have omitted in favor of including the rule form of Value with the Variable rule definition.
Finally, AndVar
should look fairly straightforward at this point. Note that it refers back to Term and implements the aforementioned cycle.
ValueChars <- [a-zA-Z0-9 !]
The complete definition looks like the following.
Input <- Term !.
Term <- Variable AndVar*
AndVar <- _ "AND" _ Term
Variable <- FieldChars+ _ ":" _ Value
_ "whitespace" <- [ \n\t\r]*
Value <- '"' ValueChars '"'
FieldChars <- [a-z]
ValueChars <- [a-zA-Z0-9 !]
The real-value of defining your own grammar comes when you provide implementations for the actions of the rules. Let’s look at the real PEG definition:
{
package main
import (
"strings"
)
type Node interface {
Evaluate(input Event) (bool, error)
}
type Term struct {
node Node
}
func (t *Term) Evaluate(input Event) (bool, error) {
return t.node.Evaluate(input)
}
type AndNode struct {
Nodes []Node
}
func (n *AndNode) Evaluate(input Event) (bool, error) {
for _, node := range n.Nodes {
matched, err := node.Evaluate(input)
if err != nil {
return false, err
}
if !matched {
return false, nil
}
}
return true, nil
}
type Variable struct {
Field string
Value string
}
func (v *Variable) Evaluate(input Event) (bool, error) {
fieldVal, ok := input[v.Field]
if !ok {
return false, fmt.Errorf("Field '%s' not present in the event", v.Field)
}
return fieldVal == v.Value, nil
}
func toString(label interface{}) string {
var sb strings.Builder
value := label.([]interface{})
for _, i := range(value) {
if i == nil {
continue
}
switch b := i.(type) {
case []byte:
sb.WriteByte(b[0])
case []interface{}:
s := toString(i)
sb.WriteString(s)
default:
fmt.Printf("She's dead, Jim %T %+v\n", i, i)
}
}
return sb.String()
}
}
Input <- Term !.
Term <- variable:Variable rest:AndVar* {
andVars := rest.([]interface{})
variables := make([]Node, 0, len(andVars))
variables = append(variables, variable.(Node))
for _, r := range(andVars) {
variables = append(variables, r.(Node))
}
return &AndNode{Nodes: variables}, nil
}
AndVar <- _ "AND" _ rightSide:Term {
return rightSide, nil
}
Variable <- field:FieldChars+ _ ":" _ value:Value {
return &Variable{Field: toString(field), Value: toString(value)}, nil
}
Value <- '"' value:ValueChars* '"' {
return value, nil
}
_ "whitespace" <- [ \n\t\r]*
FieldChars <- [a-z]
ValueChars <- [a-zA-Z0-9 !]
There’s a lot to unpack here so let’s step through it.
At the top, we have a section of code surrounded by curly braces. This is a Pigeon convention that takes the enclosed-code verbatim and injects it into the final generated code output. Note that you can define types relevant to your grammar here for convenience, or you can put them in another file as Pigeon will use the package name you declare here. Everything after that is the definition of the grammar itself.
At this point, you’ve probably noticed that the rules differ a bit from how we defined them above. Let’s take a look at Term
.
Term <- variable:Variable rest:AndVar*
This is nearly our Term
definition from before, but now we’ve made it useful. Prefixing a rule in the definition with name
: assigns the return value of that rule to name and then passes name
to the function defined by your rule action when the code is generated. All of the code between the curly braces becomes a function associated with that rule and is called when the rule is matched:
func (c *current) onTerm1(variable, rest interface{}) (interface{}, error) {
andVars := rest.([]interface{})
variables := make([]Node, 0, len(andVars))
variables = append(variables, variable.(Node))
for _, r := range andVars {
variables = append(variables, r.(Node))
}
return &AndNode{Nodes: variables}, nil
}
Here we can see that Pigeon deals entirely in empty interface types. It’s great for flexibility but less great for writing the implementation as we have to account for them.
andVars := rest.([]interface{})
variables := make([]Node, 0, len(andVars))
variables = append(variables, variable.(Node))
for _, r := range andVars {
variables = append(variables, r.(Node))
}
return &AndNode{Nodes: variables}, nil
As mentioned previously, we can have cycles in rules. Given this, we know that it’s possible to receive an infinite number of Terms mapped to the rest
variable. Pigeon deals with this by handing us a slice of empty interfaces. We’ve created a Node
interface that everything has to be type-inferred into before we can use them.
After walking the entire list of AND’d
variables, we create an AndNode
containing each of the Terms. The definition of the AndNode can be found in the code literal section defined at the top of the PEG. As an implementer of the Node interface, It defines an Evaluate method which says the expression evaluates to true only if all of the individual terms also evaluate to true.
At this point the rest of the definition should hopefully make sense, so let’s skip to running the thing. The following code listing shows parsing our rule, using it, and a couple of extra examples to show failures and error handling:
package main
import (
"fmt"
)
type Event map[string]string
func Match(rule string, event Event) (bool, error) {
// Parse is generated by pigeon in this package
tree, err := Parse("parsing", []byte(rule))
if err != nil {
return false, err
}
// Yes, this is a bit ugly. It's a consequence of pigeon dealing in interface slices
iface, ok := tree.([]interface{})
if !ok {
return false, fmt.Errorf("Internal Error")
}
ast, ok := iface[0].(Node)
if !ok {
return false, fmt.Errorf("Internal Error")
}
// ast is the tree generated by our supplementary Go defined by the production rules
return ast.Evaluate(event)
}
func main() {
rules := []string{
"event:\"sent\" AND subject:\"A special offer just for you!\"",
"event:\"found\" AND subject:\"A special offer just for you!\"",
"event:\"sent\" AND badfield:\"A special offer just for you!\"",
}
event := map[string]string{
"event": "sent",
"subject": "A special offer just for you!",
"account": "12345",
}
for _, rule := range rules {
matched, err := Match(rule, Event(event))
if err != nil {
fmt.Printf("Rule '%s' failed with error '%s'\n", rule, err.Error())
} else {
fmt.Printf("Rule '%s' matched: %t\n", rule, matched)
}
}
}
All you have to do is run:
~> go get
github.com/mna/pigeon
~> pigeon rules.peg > rules.go ~> go run . Rule 'event:"sent" AND subject:"A special offer just for you!"' matched: true Rule 'event:"found" AND subject:"A special offer just for you!"' matched: false Rule 'event:"sent" AND badfield:"A special offer just for you!"' failed with error 'Field 'badfield' not present in the event'
Congratulations! You’ve now written your own language of extremely limited utility!
At this point you may be thinking:
To which we would respond:
account
as defined in the original event. Mailgun solves this problem by including business logic in the codebase to support rate-limiting based on a defined field. In other words, we can say “If this rule matches, save the account and current time and don’t perform this action against this account for the next N minutes.” Obviously, if we didn’t have this handling and the sender in question sends one million emails, support would be notified one million times. That’d be a great way to drown your support team in noise and also get jumped in the parking lot after work.Lastly, you likely noticed we didn’t define what the rule would do here when it does match. One possible solution is to have the rule post to a Slack channel when it’s triggered. This is a common solution here at Mailgun, and looks like the following:
slack:#channel-name
Hopefully, at this point, your mind is spinning with possibilities. Here are some thoughts we’ve had:
There are loads more you can do, especially when it comes to fleshing out the action parser. You could call other services, notify customers, turn the lights on and off, and more. When it comes to inventing your own tools based on this, the sky’s the limit.
So at this point, you might be asking yourself: “why didn’t they use $SOME_OTHER_TOOL?”
The answer is because we don’t yet need the power of any of those tools nor the complexity that frequently comes with operating them. Our current stream processing tool is a single, simple binary deployed in a container that Just Works™. There’s very little to manage, and it’s readily keeping up with our very high volume event feed.
We may eventually decide that $SOME_OTHER_TOOL is more suitable for our needs, but for the foreseeable future, the power our Mailgun DSL brings to the table is more than sufficient.
At this point, we hope we’ve demonstrated how writing your own DSL could be useful. Our implementation processes an enormous number of events per day while making the lives of many of our team members much easier (at least that’s what we tell them).
As for the features it supports, we’re just getting started. If this has inspired you to look into implementing your own stream processing grammar, please let us know. We’d love to hear about it!