Monday, April 21, 2008

Grammar noise cancellation

Imagine a simple Irony grammar written in C# and capable of parsing the following string into a list of greetings:

mjallo dude howdy dudess hey "gal"

This could be written like so:

class GreetGrammar : Grammar
{
 public GreetGrammar()
 {
   var stringlit = new StringLiteral("stringlit");
   var id = new IdentifierTerminal("id");
   var homergreeting = new NonTerminal("homergreeting", "mjallo" + id);
   var cowboygreeting = new NonTerminal("cowboygreeting", "howdy" + id);
   var greeting = new NonTerminal("greeting",
                          homergreeting |
                          "hey" + stringlit |
                          cowboygreeting);
   var program = new NonTerminal("program",
                          greeting.Star());
   Root = program;
 }
}

There’s quite a few characters of noise in this, compared to a clean EBNF syntax. Let’s enumerate what’s distracting:

  1. 1. Every declaration has the name declared twice: As an identifier and again in a string.
  2. 2. "new NonTerminal" is repeated a lot and hinders readability.
  3. 3. "+" is used as a so called sequence operator, whereas BNF uses a space.
  4. 4. "var" is distracting.
  5. 5. The *, ? and + operators in EBNF are written as method calls: Star(), Q() and Plus() respectively.

If we switch to the Boo language, then with three little Boo macros, we can get rid of 1 and 2 to render this:

class BooGrammar(Grammar):
  def constructor():
     stringliteral stringlit
     identifier id
     rule cowboygreeting, "howdy" + id
     rule homergreeting, "mjallo" + id
     rule greeting, homergreeting | ("hey" + stringlit) | cowboygreeting
     rule program, greeting.Star()
     Root = program 

That’s a little bit nicer, innit? Note that the var keyword is implicit in Boo, so that’s a free ride.

The only new noise added is the parenthesis around "hey" + stringLit. I believe it has something to do with a difference (between C# and Boo) in operator precedence and operator overloading. If you omit the parenthesis, the grammar compiles, but it won’t parse the input string properly.

So how about noise 3 and 5? Can we get rid of those as well?

The problem with the sequence operator (3) is that space is not an operator in either C# or Boo, so there’s nothing to overload.

Let’s look at the three operators, *, ? and +. Well in general purpose languages like C# and Boo, * and + are binary operators. In EBNF they are unary. Therefore it’s not possible to steal them. And ? is a ternary operator in C#, and it’s not an operator at all in Boo.

Still, this is a small step towards noise free executable grammars in .NET.

No comments:

Post a Comment