Monday, September 28, 2009

Writing fast protocol-compliant programs quickly

I've used bison and flex for some text parsing. Another project of mine was done with ANTLR (see xssprotect), where it's basically an HTML tag filter that allows well-known tags and attributes and removes all others. Flex and Bison work together in a way, but you need to keep on your toes to understand which does what. The combination isn't suitable for all kinds of parsing and one of the reasons why programming languages are non-ambiguous by nature is so that it compiles into machine- or interpretercode exactly as intended by the programmer without unwanted side effects. Regular file parsing with flex and bison becomes more difficult once you don't have control over the format of the input file; that is, if you don't control what it should look like. This is because bison is dependent on the lexer, and the lexer should output the correct tokens such that bison can apply them properly. You could say that the lexer is more about syntax and bison is more about semantics and ordering.

So flex feeds bison and bison allows you to execute actions that should take place once things are recognized. The general tutorials all over the internet always show the same example: a calculator. Some code in flex looks like this:


%{
#include
%}
%s BRACKET
%%
[A-Za-z0-9]+ { printf("Symbol: %s\n",yytext); }
"(" { BEGIN BRACKET; }
")" { BEGIN INITIAL; /* Switch Back */}
[A-Za-z \t]+ { printf("BRACKET: %s\n",yytext); }
.|\n { }
%%
int yywrap(void) { return 1; }
int main(int arg,char *argv[])
{
yylex();
printf("Bye...Bye...\n");
return 0;
}

This is input to flex. It's just printing symbols outside any context and when it encounters any brackets ( or ), it switches to different states, such that certain tokens can be disregarded, or you can start regarding them. The curly brackets are basically standard C code. Other parsers that are more advanced typically use them to return a token value (an integer), such that bison can use it. Bison would then typically attach it to a certain context.

Hence, it becomes clear why programming languages have so many magic tokens, like: { ( [ ] ) } " ' 0x * & ^ % $ # @ \ | ; : and so on. Most of these tokens are delimiters to create some kind of action. The { } tokens are probably the most interesting, as many mature languages use these to control scope of variables and instructions.

Bison scripts look like the following:


input: /* empty */
| input line
;
line:
'\n'
| exp '\n' { printf ("\t%.10g\n", $1); }
| error '\n' { yyerrok; }
;
exp: NUM { $$ = $1; }
| VAR { $$ = $1->value.var; }
| VAR '=' exp { $$ = $3; $1->value.var = $3; }
| FNCT '(' exp ')' { $$ = (*($1->value.fnctptr))($3); }
| exp '+' exp { $$ = $1 + $3; }
| exp '-' exp { $$ = $1 - $3; }
| exp '*' exp { $$ = $1 * $3; }
| exp '/' exp { $$ = $1 / $3; }
| '-' exp %prec NEG { $$ = -$2; }
| exp '^' exp { $$ = pow ($1, $3); }
| '(' exp ')' { $$ = $2; }
;
/* End of grammar */
%%

You should notice how the C instructions, also in this case, permeate the general function of the processor. The processor basically has a stack of memory of tokens that were processed before, whether these are simple expressions or tokens ( a "1" is a token and an expression, a "1+1" is an expression composed of two other expressions and an operator). This calculator function above shows how bison works iteratively down your sentence and then compiles a sort of tree structure. This tree structure can be influenced to take operator precedence into account. Also, you could choose to make it contextual here.

The differences between flex and bison should become somewhat clear by these examples. Lexer doesn't know anything about the meaning of what it is processing, just that it complies with the lexing rules. An integer is a sequence of 0-9 digits without any dots or comma's. A float is a sequence of 0-9 digits with a dot or comma in the middle somewhere (followed by more digits). A char token has some alphabetic characters in them. You could easily construct your own language by prefixing or postfixing them and then making sure the rules are processed in the right order.

You could also make the parser interactive, so it becomes a command line application. Rather than verifying the exact thing was entered, you could now attempt to handle whatever was typed. So basically you could create some kind of adventure game with the above. In the adventure game, you'd probably end up recognizing verbs specifically, but maybe use a more dynamic method for dealing with nouns (objects and rooms differ from game to game). That way, the engine is reusable for entirely different contexts.

Now that you know about bison; you'll find ragel and lemon are of better use. These tools have been used to parse SQL, or geographic data (libgeom) for calculating paths and distances and so on. Without too much effort, you could script together a method for doing geometric calculations that way, such that you can invoke this from the command line.

Notice how the parser does force you to think in tree structures though. Check out this link for why state machines are quite interesting from a protocol design / implementation perspective. Often, when you parse code yourself, you find yourself knee-deep in invalid states, states that should not be reached from certain contexts, and so on. Why program that yourself? :)

State machines are also often used in Artificial Intelligence, so it makes sense in certain places to mix up these things with artificial reasoning. The problem with state machines is that it's possible to have missed some dead-end. In that case, the machine may become stuck forever and might not be able to find a sequential state to go to. Ragel though has a nice way of showing it's rules in a directed graph. That should help to make things clearer and spot those difficult places.

No comments: