How we built a Lucene-inspired parser in Go

Written by Matt Dietz

Categories: For Devs

11 minute read time

At Mailgun we have numerous systems generating a ton of events every hour of the day. It’s a number so large that it’s impossible for a team of people to sort through ElasticSearch results and expect consistent results or for the team to maintain their sanity.

Engineering before the advent of our streaming tool

Like many companies, we still need to find a way to deal with all the data. This makes for a substantial task: create a system that allows our internal customers to react to events in real-time.

Traditionally, options for solving such a problem include:

  • Feeding the data into a database and then running database queries, often on a fixed time interval
  • Map-reduce solutions like Hadoop
  • Mechanical Turk  
  • Real-time filters.

We pride ourselves on doing a lot with very little, so we set out to see how we could make real-time filters work for us. We focused on building a system that allows essentially anyone in the engineering and support teams to write rules that react and perform actions on their behalf.

Brainstorming

When designing this system, we tried to build something that would give everyone the tools to rapidly iterate and respond to threats to our system and had to conform to the following rules:

  1. The users should be able to add their own rules at will
  2. The domain-specific-language henceforth referred to as a DSL, should be as simple as possible and ideally look like something they already know.
  3. People should be able to test their rules on the historical data before applying them.

We use Kibana internally, so something that could be tested against that to fulfill point number 3 was ideal. This leads us to a straight-forward conclusion: we should write a Lucene inspired DSL parser! 

The first iteration of the system relied on regular expressions and lots of character-by-character parsing. It got the job done, but subsequent iterations rapidly grew in complexity. If you’ve ever done any large string parsing project, you know what we mean.

Moving forward, we knew we had to build something that could be iterated upon quickly, and ideally, conformed to a known standard.

Getting Grammatical

Those of us that have taken Automata Theory college courses and the like probably remember grammars in the context of programming languages. Furthermore, it’s given that some readers will have far more expertise than your humble author when it comes to the subject, so any in-depth explanations are to be referred to your friendly neighborhood search engine.

It’s still useful to talk briefly about what makes a grammar and why they’re useful. To quote Wikipedia, a Formal Grammar is “…a set of production rules for strings in a formal language. The rules describe how to form strings from the language’s alphabet that are valid according to the language’s syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form.”

At the risk of oversimplifying, it lets us take a string like “You’ve got mail!” and break it down into tokens that we can semantically analyze. In other words, if our grammar is well defined, we can then write additional code to determine the meaning of that string.

The history and types of grammars can fill tons of books and are thus well beyond the scope of a blog post. Instead, we’ll zero in on context-free grammars and more specifically, parsing expression grammars.

Context-free grammars and parsing expression grammars

Quoting Wikipedia again, a Context-Free Grammar is “a certain type of formal grammar: a set of production rules that describe all possible strings in a given formal language. Production rules are simple replacements.”

To put it another way, a context-free grammar defines a graph which in turn defines how to parse a language. As mentioned above, it does not tell us how to understand the language. It’s also worth mentioning as we’ll see below that the graph may have cycles, but it eventually terminates.

Context-free grammars have a cousin of sorts in the Parsing Expression Grammar, or PEG. They look a lot like a context-free grammar, or CFG, but come with a very useful distinction: when confronted with an ambiguous choice, the PEG will always choose the first production rule defined. The CFG instead leaves the choice ambiguous, and why bother with ambiguity when we don’t have to?

What this really means is that it’s easier to write tools to define PEGs. Fortunately, many people have already done that for us.

Taking Flight

After some research, we settled on Pigeon as the basis of our PEG. Pigeon follows the same paradigm as a lot of Go tools by generating code that compiles alongside your program.

Using it is easy enough: You define both the grammar and the Go code to handle each of the rules in the same package, and simply call Parse() on the generated parser. The challenge is to write handlers for each of those rules. Like grammars, the topic of compilation could fill a library with books. Fortunately for us, writing an interpreter is less complex and the resulting code is sufficiently fast for our needs.

Furthermore, the syntax tries to align itself closely to Go syntax itself, which makes it easier when writing your PEG alongside your Go program utilizing the grammar.

Like a leaf on the wind

Say you have an event generated by sending an email, encoded in JSON, that looks like the following:

{
 "event": "sent",
 "subject": "A special offer just for you!",
 "account": "12345",
}

There are a couple of ways to look at this event: you could decide this is an innocuous email, or you could say “This might be spam, but there’s not enough information to decide”. This is a very common problem in our system, and always requires lots of aggregated data to come to a conclusion.

So, let’s write a rule that will match the above event:

event:"sent" AND subject:"A special offer just for you!"

This is fairly easy to parse, and better yet, it can also be checked in Kibana!

Now let’s walk through constructing a very simple PEG that would match this rule. We’ll implement each production rule one at a time and explain what it does. Rule names typically follow PascalCase as well as allowing numbers and underscores. Each rule takes the form:

RuleName <- <rule definition>

Note that are actually several legal characters for the rule definition operator but we like <- as it’s easy to type and looks most like grammars as they’re defined academically.

The first rule in the grammar is treated as the entrypoint:

Input <- Term !.

Our Input rule says “Match the entire input string” with !. meaning “Match the end of the file”. This will pass whatever the entire input string to a rule named Term which looks like the following:

Term <- Variable AndVar*

This passes control-flow to the Variable production rule:

Variable <-  FieldChars+ _ ":" _ Value

Now we’re getting a little more complex. First, we greedily match all characters until the FieldChars rule is satisfied. Note the + there at the end. PEG syntax shares a lot in common with regular expressions, so this is saying “Match the rule FieldChars one or more times”

FieldChars <- [a-z]

Again, if you’re familiar with regex syntax this should be straight-forward: Simply match any single character between “a” and “z”

Now back to Variable, the next part _ is a rule saying “match any whitespace character”. Note that _ is a completely valid identifier for a production rule, and less typing is usually a good thing.

_ "whitespace" <- [ \n\t\r]*

There are a couple of things to note here:

  1. The string “whitespace” is the “friendly name” and exists for documentation purposes and your future sanity.
  2. As with regular expression syntax, this will match any form of white space, newlines, tabs or carriage-returns.

The next part of the Variable rule matches the string literal “AND”. It doesn’t get much easier than that.

Next, we have another whitespace rule, and then finally the Value rule.

Value <- '"' ValueChars* '"'

This rule matches the string literal '"' , zero or more ValueChars rules and lastly another quote.

ValueChars looks like the following:

ValueChars <- [a-zA-Z0-9 !]

These ValueChars will match the lowercase and uppercase alphabet, any number, spaces, and the exclamation point.

Why did we define Value this way? Because the rule can strip the double-quotes off for us so we don’t have to, plus we’re lazy. However, it’s merely a convenience we could have omitted in favor of including the rule form of Value with the Variable rule definition.

Finally, AndVar should look fairly straightforward at this point. Note that it refers back to Term and implements the aforementioned cycle.

AndVar <- _ "AND" _ Term

The complete definition looks like the following.

Input <- Term !.
Term <- Variable AndVar*
AndVar <- _ "AND" _ Term
Variable <- FieldChars+ _ ":" _ Value
_ "whitespace" <- [ \n\t\r]*
Value <- '"' ValueChars '"'
FieldChars <- [a-z]
ValueChars <- [a-zA-Z0-9 !]

Cool, but what now?

The real-value of defining your own grammar comes when you provide implementations for the actions of the rules. Let’s look at the real PEG definition:

{
 package main
​
 import (
   "strings"
 )
​
 type Node interface {
   Evaluate(input Event) (bool, error)
 }
​
 type Term struct {
   node Node
 }
​
 func (t *Term) Evaluate(input Event) (bool, error) {
   return t.node.Evaluate(input)
 }
​
 type AndNode struct {
   Nodes []Node
 }
​
 func (n *AndNode) Evaluate(input Event) (bool, error) {
   for _, node := range n.Nodes {
     matched, err := node.Evaluate(input)
     if err != nil {
       return false, err
     }
     if !matched {
       return false, nil
     }
   }
   return true, nil
 }
​
 type Variable struct {
   Field string
   Value string
 }
​
 func (v *Variable) Evaluate(input Event) (bool, error) {
   fieldVal, ok := input[v.Field]
   if !ok {
     return false, fmt.Errorf("Field '%s' not present in the event", v.Field)
   }
​
   return fieldVal == v.Value, nil
 }
​
 func toString(label interface{}) string {
   var sb strings.Builder
   value := label.([]interface{})
   for _, i := range(value) {
     if i == nil {
       continue
     }
     switch b := i.(type) {
     case []byte:
       sb.WriteByte(b[0])
     case []interface{}:
       s := toString(i)
       sb.WriteString(s)
     default:
       fmt.Printf("She's dead, Jim %T %+v\n", i, i)
     }
   }
   return sb.String()
 }
}
​
Input <- Term !.
​
Term <- variable:Variable rest:AndVar* {
 andVars := rest.([]interface{})
 variables := make([]Node, 0, len(andVars))
 variables = append(variables, variable.(Node))
 for _, r := range(andVars) {
      variables = append(variables, r.(Node))
   }
   return &AndNode{Nodes: variables}, nil
}
​
AndVar <- _ "AND" _ rightSide:Term {
   return rightSide, nil
}
​
Variable <- field:FieldChars+ _ ":" _ value:Value {
   return &Variable{Field: toString(field), Value: toString(value)}, nil
}
​
Value <- '"' value:ValueChars* '"' {
   return value, nil
}
​
_ "whitespace" <- [ \n\t\r]*
FieldChars <- [a-z]
ValueChars <- [a-zA-Z0-9 !]

There’s a lot to unpack here so let’s step through it.

At the top, we have a section of code surrounded by curly braces. This is a Pigeon convention that takes the enclosed-code verbatim and injects it into the final generated code output. Note that you can define types relevant to your grammar here for convenience, or you can put them in another file as Pigeon will use the package name you declare here. Everything after that is the definition of the grammar itself.

At this point, you’ve probably noticed that the rules differ a bit from how we defined them above. Let’s take a look at Term.

Term <- variable:Variable rest:AndVar*

This is nearly our Term definition from before, but now we’ve made it useful. Prefixing a rule in the definition with name: assigns the return value of that rule to name and then passes name to the function defined by your rule action when the code is generated. All of the code between the curly braces becomes a function associated with that rule and is called when the rule is matched:

func (c *current) onTerm1(variable, rest interface{}) (interface{}, error) {
 andVars := rest.([]interface{})
 variables := make([]Node, 0, len(andVars))
 variables = append(variables, variable.(Node))
 for _, r := range andVars {
   variables = append(variables, r.(Node))
 }
 return &AndNode{Nodes: variables}, nil
}

Here we can see that Pigeon deals entirely in empty interface types. It’s great for flexibility but less great for writing the implementation as we have to account for them.

andVars := rest.([]interface{})
variables := make([]Node, 0, len(andVars))
variables = append(variables, variable.(Node))
for _, r := range andVars {
 variables = append(variables, r.(Node))
}
return &AndNode{Nodes: variables}, nil

As mentioned previously, we can have cycles in rules. Given this, we know that it’s possible to receive an infinite number of Terms mapped to the rest variable. Pigeon deals with this by handing us a slice of empty interfaces. We’ve created a Node interface that everything has to be type-inferred into before we can use them.

After walking the entire list of AND’d variables, we create an AndNode containing each of the Terms. The definition of the AndNode can be found in the code literal section defined at the top of the PEG. As an implementer of the Node interface, It defines an Evaluate method which says the expression evaluates to true only if all of the individual terms also evaluate to true.

At this point the rest of the definition should hopefully make sense, so let’s skip to running the thing. The following code listing shows parsing our rule, using it, and a couple of extra examples to show failures and error handling:

package main
​
import (
 "fmt"
)
​
type Event map[string]string
​
func Match(rule string, event Event) (bool, error) {
 // Parse is generated by pigeon in this package
 tree, err := Parse("parsing", []byte(rule))
 if err != nil {
   return false, err
 }
​
 // Yes, this is a bit ugly. It's a consequence of pigeon dealing in interface slices
 iface, ok := tree.([]interface{})
 if !ok {
   return false, fmt.Errorf("Internal Error")
 }
​
 ast, ok := iface[0].(Node)
 if !ok {
   return false, fmt.Errorf("Internal Error")
 }
​
 // ast is the tree generated by our supplementary Go defined by the production rules
 return ast.Evaluate(event)
}
​

func main() {
  rules := []string{
   "event:\"sent\" AND subject:\"A special offer just for you!\"",
   "event:\"found\" AND subject:\"A special offer just for you!\"",
   "event:\"sent\" AND badfield:\"A special offer just for you!\"",
  }


  event := map[string]string{
    "event":   "sent",
    "subject": "A special offer just for you!",
    "account": "12345",
  }
​
  for _, rule := range rules {
    matched, err := Match(rule, Event(event))
    if err != nil {
      fmt.Printf("Rule '%s' failed with error '%s'\n", rule, err.Error())
    } else {
      fmt.Printf("Rule '%s' matched: %t\n", rule, matched)
    }
  }
}

All you have to do is run:

~> go get github.com/mna/pigeon
 
~> pigeon rules.peg > rules.go
 
~> go run .

Rule 'event:"sent" AND subject:"A special offer just for you!"' matched: true
Rule 'event:"found" AND subject:"A special offer just for you!"' matched: false
Rule 'event:"sent" AND badfield:"A special offer just for you!"' failed with error 'Field 'badfield' not present in the event'

Congratulations! You’ve now written your own language of extremely limited utility!

Considerations

At this point you may be thinking:

  1. Don’t a lot of senders use subjects like that?
  2. Is a rules engine that only matches strings literals all that useful?
  3. Couldn’t this post have been half as long?

To which we would respond:

  1. We would want to implement some kind of handling for the account as defined in the original event. Mailgun solves this problem by including business logic in the codebase to support rate-limiting based on a defined field. In other words, we can say “If this rule matches, save the account and current time and don’t perform this action against this account for the next N minutes.” Obviously, if we didn’t have this handling and the sender in question sends one million emails, support would be notified one million times. That’d be a great way to drown your support team in noise and also get jumped in the parking lot after work.
  2. One could implement rules to parse regular expressions and substrings that would be far more useful than an explicit string match. In fact, we’ve done exactly that.
  3. Sorry.

Lastly, you likely noticed we didn’t define what the rule would do here when it does match. One possible solution is to have the rule post to a Slack channel when it’s triggered. This is a common solution here at Mailgun, and looks like the following:

slack:#channel-name

Next Steps

Hopefully, at this point, your mind is spinning with possibilities. Here are some thoughts we’ve had:

  1. There’s much more to emulate in lucene syntax like NOT and OR statements, and parentheses.
  2. Provide a separate grammar for actions.
  3. Implement templatization to inject values from the event into the action.
  4. Layer on syntactic sugar to make writing some filters easier.
  5. Implement a system to count and react to a specific number of event occurrences over a time period.
  6. Implementing a cache because parsing the rules is somewhat expensive.

There are loads more you can do, especially when it comes to fleshing out the action parser. You could call other services, notify customers, turn the lights on and off, and more. When it comes to inventing your own tools based on this, the sky’s the limit.

KNIFE-WRENCH!

Pros and Cons

So at this point, you might be asking yourself: “why didn’t they use $SOME_OTHER_TOOL?” 

The answer is because we don’t yet need the power of any of those tools nor the complexity that frequently comes with operating them. Our current stream processing tool is a single, simple binary deployed in a container that Just Works™. There’s very little to manage, and it’s readily keeping up with our very high volume event feed.

Pros:
  1. It’s easy to develop
  2. It’s still using off-the-shelf, FOSS tools
  3. It can be tailored to exactly our needs
  4. The syntax for our DSL is tiny and easy to articulate and reason about, and it forces everyone to write rules roughly the same way. It also makes it harder to create unintended side-effects
Cons:
  1. Language features beyond what we’ve already implemented represent a substantial leap in complexity.
  2. Updating the source PEG will generate new code, and it really mucks up your code diffs.
  3. Lucene’s deliberate simplicity can make it difficult to write more complex rules.

We may eventually decide that $SOME_OTHER_TOOL is more suitable for our needs, but for the foreseeable future, the power our Mailgun DSL brings to the table is more than sufficient.

The End, Finally

At this point, we hope we’ve demonstrated how writing your own DSL could be useful. Our implementation processes an enormous number of events per day while making the lives of many of our team members much easier (at least that’s what we tell them). 

As for the features it supports, we’re just getting started. If this has inspired you to look into implementing your own stream processing grammar, please let us know. We’d love to hear about it!

Tags: | | |

Modified on: September 13, 2019

Stay up-to-date with our blog & new email resources

We'll let you know when we add new email resources and blog posts. We promise not to spam you.