In building Exospecies I ended up having a lot of different types of files that needed to be parsed: HTML files, settings files, Prolog files, etc. I went and researched the best tools for parsing (and there are many!) but they all seemed like big black boxes that I was going to spend a lot of time debugging, understanding, learning how to use, etc. I wanted something that was really simple, lightweight, debuggable, etc. Plus, I didn’t think I’d really understand any of them if I didn’t understand how parsers worked in the first place. Thus, the Inductor Parser Engine was born! It is the bottom layer of the Exospecies HTN Architecture:
The code for the parser is available on GitHub. This blog post will describe how to use it.
Creating a Grammar
IndParser is called a PEG Parser, which is a very simple to understand and implement type of Recursive Descent Parser. I chose this approach because the way you express your parser rules is very readable and understandable and it is straightforward to implement.
It is written in C++ so it can be ported across many platforms (PC, Mac, XBox, etc), and uses C++ Templates because I found them to be a very natural way to express grammars.
To create a grammar using IndParser, you write a set of rules that describe what you expect to find in the string you want to parse. For example, if you want to parse name/value pairs like this:
You would declare what you are looking for, in the order you expect it:
Nice and easy! Here’s how you would write the grammar using the built-in IndParse C++ template classes:
You declare a class that will be your rule (we called it
Because this is using C++ Templates, there is some extra goo you have to put in. For example, all of the built in rules normally a first argument that is “the type of thing you are looking for”. Rules like
Parsing Using the Grammar
To actually run the grammar on a string, you call the static
Tuning a Grammar
After you end up walking these trees a bunch, two things become obvious:
The first issue is solved by a concept called “Flattening”. Most rules have a template argument called
To take advantage of Flattening, you have one more step after parsing:
Turns out the default is
So, after running
Much simpler! However, the string
Here’s the updated rule:
Now if we parse and flatten, we get this tree:
Full Example Code
Here’s the full code to create the rule and parse it, getting those two values out:
Creating Multiple Rules
But wait! What if you want settings to contain more than just integers? Something like
That is how you build up a more complex grammar: defining rules and using them within other rules. The parse tree for
or if we parsed
Note that the
It turns out that giving a good parse error is hard since the code can’t tell what you meant to type. However, the IndParser uses a heuristic that seems to work out pretty well: Assume that the error is in the deepest (meaning farthest down the tree) rule that was tried and failed. The idea being that we made the most progress there and so that’s probably where the error occurred.
To make it easy to return meaningful error messages, most of the built-in templates have a
Your grammar needs to consume the entire string or it will fail.
If you tried the above example with the string
I usually have a rule called “document” to lay out the whole structure of a document. In this case it is simple:
So far we’ve been talking about Parsing, which is taking a string of text and turning it into a set of well structured symbols. Compiling is that plus walking through those symbols and turning them into whatever-it-is-that-you-are-trying-to-build. Taking HTML text and turning it into rendering objects, for example. IndParse can’t do that work for you, but it can give you a handy class called…
“All” you need to do is create a grammar, and then write all the code that is custom for your Compile operation. Here’s the simplest possible example using
Turns out you don’t always declare rules properly or rules you declare might give unexpected results. If a rule fails, how do you figure out what went wrong?
While I wish using a normal debugger was more helpful, it really isn’t. The parser engine is running all kinds of rules in various different parts of the tree, backtracking, etc. Turns out it is really hard in practice to set a breakpoint that helps you.
Instead there are the obvious things: break down the rule into smaller pieces, comment things out, etc.
If that doesn’t work, the parser has (very verbose!) tracing you can turn on that will output information about each and every path the parser takes so you can see where things fail. To turn on tracing you just put this line before your
Let’s modify the rule so that it doesn’t allow whitespace and then try to parse something with whitespace with tracing on.
If we then try to parse this string
The structure of the trace output can be a little counterintuitive. You’ll notice it starts out indented, setting out traces from
Now for the “deep” analysis: Starting from the top: You can see the Lexer reading characters one by one. You can also see the
OK, this was an easy one, but you get the idea. Sometimes the only way to debug these is to run the rule and dig through the traces.
Go use the parser to do good in the world! The code for the parser is available on GitHub.