A foundation for basic parsers that tokenizes input content and allows parsers to easily access and use those tokens. A
TokenStream object literally represents the stream of
Token objects that each represent a word, symbol, comment
or other lexically-relevant piece of information. This simple framework makes it very easy to create a parser that walks
through (or "consumes") the tokens in the order they appear and do something useful with that content (usually creating another
representation of the content, such as some domain-specific Abstract Syntax Tree or object model).
The parts
This simple framework consists of a couple of pieces that fit together to do the whole job of parsing input content.
The
Tokenizer is responsible for consuming the character-level input content and constructing
Token objects for
the different words, symbols, or other meaningful elements contained in the content. Each Token object is a simple object that
records the character(s) that make up the token's value, but it does this in a very lightweight and efficient way by pointing
to the original character stream. Each token can be assigned a parser-specific integral token type that may make it
easier to do quickly figure out later in the process what kind of information each token represents. The general idea is to
keep the Tokenizer logic very simple, and very often Tokenizers will merely look for the different kinds of characters (e.g.,
symbols, letters, digits, etc.) as well as things like quoted strings and comments. However, Tokenizers are never called by the
parser, but instead are always given to the TokenStream that then calls the Tokenizer at the appropriate time.
The
TokenStream is supplied the input content, a Tokenizer implementation, and a few options. Its job is to prepare the
content for processing, call the Tokenizer implementation to create the series of Token objects, and then provide an interface
for walking through and consuming the tokens. This interface makes it possible to discover the value and type of the current
token, and consume the current token and move to the next token. Plus, the interface has been designed to make the code that
works with the tokens to be as readable as possible.
The final component in this framework is the Parser. The parser is really any class that takes as input the content to
be parsed and that outputs some meaningful information. The parser will do this by defining the Tokenizer, constructing a
TokenStream object, and then using the TokenStream to walk through the sequence of Tokens and produce some meaningful
representation of the content. Parsers can create instances of some object model, or they can create a domain-specific Abstract
Syntax Tree representation.
The benefit of breaking the responsibility along these lines is that the TokenStream implementation is able to encapsulate
quite a bit of very tedious and very useful functionality, while still allowing a lot of flexibility as to what makes up the
different tokens. It also makes the parser very easy to write and read (and thus maintain), without placing very many
restrictions on how that logic is to be defined. Plus, because the TokenStream takes responsibility for tracking the positions
of every token (including line and column numbers), it can automatically produce meaningful errors.
Consuming tokens
A parser works with the tokens on the TokenStream using a variety of methods:
- The
#start() method must be called before any of the other methods. It performs initialization and tokenizing, and
prepares the internal state by finding the first token and setting an internal current token reference.
- The
#hasNext() method can be called repeatedly to determine if there is another token after the current
token. This is often useful when an unknown number of tokens is to be processed, and behaves very similarly to the
Iterator#hasNext() method.
- The
#consume() method returns the
Token#value() of the current token and moves the current
token pointer to the next available token.
- The
#consume(String) and
#consume(char) methods look at the current token and ensure the token's
Token#value() matches the value supplied as a method parameter, or they throw a
ParsingException if the
values don't match. The
#consume(int) method works similarly, except that it attempts to match the token's
Token#type(). And, the
#consume(String,String...) is a convenience method that is equivalent to calling
#consume(String) for each of the arguments.
- The
#canConsume(String) and
#canConsume(char) methods look at the current token and check whether
the token's
Token#value() matches the value supplied as a method parameter. If there is a match, the method
advances the current token reference and returns true. Otherwise, the current token does not match and the method
returns false without advancing the current token reference or throwing a ParsingException. Similarly, the
#canConsume(int) method checks the token's
Token#type() rather than the value, consuming the token and
returning true if there is a match, or just returning false if there is no match. The
#canConsume(String,String...)method determines whether all of the supplied values can be consumed in the given order.
- The
#matches(String) and
#matches(char) methods look at the current token and check whether the
token's
Token#value() matches the value supplied as a method parameter. The method then returns whether there was
a match, but does not advance the current token pointer. Similarly, the
#matches(int) method checks the
token's
Token#type() rather than the value. The
#matches(String,String...) method is a convenience method
that is equivalent to calling
#matches(String) for each of the arguments, and the
#matches(int,int...) method
is a convenience method that is equivalent to calling
#matches(int) for each of the arguments.
The
#matchesAnyOf(String,String...) methods look at the current token and check whether the token's
Token#value() matches at least one of the values supplied as method parameters. The method then returns whether
there was a match, but does not advance the current token pointer. Similarly, the
#matchesAnyOf(int,int...) method checks the token's
Token#type() rather than the value.
With these methods, it's very easy to create a parser that looks at the current token to decide what to do, and then consume
that token, and repeat this process.
Example parser
Here is an example of a very simple parser that parses very simple and limited SQL SELECT
and DELETE
statements, such as SELECT * FROM Customers
or
SELECT Name, StreetAddress AS Address, City, Zip FROM Customers
or
DELETE FROM Customers WHERE Zip=12345
:
public class SampleSqlSelectParser {
public List<Statement> parse( String ddl ) {
TokenStream tokens = new TokenStream(ddl, new SqlTokenizer(), false);
List<Statement> statements = new LinkedList<Statement>();
token.start();
while (tokens.hasNext()) {
if (tokens.matches("SELECT")) {
statements.add(parseSelect(tokens));
} else {
statements.add(parseDelete(tokens));
}
}
return statements;
}
protected Select parseSelect( TokenStream tokens ) throws ParsingException {
tokens.consume("SELECT");
List<Column> columns = parseColumns(tokens);
tokens.consume("FROM");
String tableName = tokens.consume();
return new Select(tableName, columns);
}
protected List<Column> parseColumns( TokenStream tokens ) throws ParsingException {
List<Column> columns = new LinkedList<Column>();
if (tokens.matches('*')) {
tokens.consume(); // leave the columns empty to signal wildcard
} else {
// Read names until we see a ','
do {
String columnName = tokens.consume();
if (tokens.canConsume("AS")) {
String columnAlias = tokens.consume();
columns.add(new Column(columnName, columnAlias));
} else {
columns.add(new Column(columnName, null));
}
} while (tokens.canConsume(','));
}
return columns;
}
protected Delete parseDelete( TokenStream tokens ) throws ParsingException {
tokens.consume("DELETE", "FROM");
String tableName = tokens.consume();
tokens.consume("WHERE");
String lhs = tokens.consume();
tokens.consume('=');
String rhs = tokens.consume();
return new Delete(tableName, new Criteria(lhs, rhs));
}
}
public abstract class Statement { ... }
public class Query extends Statement { ... }
public class Delete extends Statement { ... }
public class Column { ... }
This example shows an idiomatic way of writing a parser that is stateless and thread-safe. The
parse(...)
method
takes the input as a parameter, and returns the domain-specific representation that resulted from the parsing. All other
methods are utility methods that simply encapsulate common logic or make the code more readable.
In the example, the parse(...)
first creates a TokenStream object (using a Tokenizer implementation that is not
shown), and then loops as long as there are more tokens to read. As it loops, if the next token is "SELECT", the parser calls
the parseSelect(...)
method which immediately consumes a "SELECT" token, the names of the columns separated by
commas (or a '*' if there all columns are to be selected), a "FROM" token, and the name of the table being queried. The
parseSelect(...)
method returns a Select
object, which then added to the list of statements in the
parse(...)
method. The parser handles the "DELETE" statements in a similar manner.
Case sensitivity
Very often grammars to not require the case of keywords to match. This can make parsing a challenge, because all combinations
of case need to be used. The TokenStream framework provides a very simple solution that requires no more effort than providing
a boolean parameter to the constructor.
When a false
value is provided for the the caseSensitive
parameter, the TokenStream performs all
matching operations as if each token's value were in uppercase only. This means that the arguments supplied to the
match(...)
, canConsume(...)
, and consume(...)
methods should be upper-cased. Note that
the actual value of each token remains the actual case as it appears in the input.
Of course, when the TokenStream is created with a true
value for the caseSensitive
parameter, the
matching is performed using the actual value as it appears in the input content
Whitespace
Many grammars are independent of lines breaks or whitespace, allowing a lot of flexibility when writing the content. The
TokenStream framework makes it very easy to ignore line breaks and whitespace. To do so, the Tokenizer implementation must
simply not include the line break character sequences and whitespace in the token ranges. Since none of the tokens contain
whitespace, the parser never has to deal with them.
Of course, many parsers will require that some whitespace be included. For example, whitespace within a quoted string may be
needed by the parser. In this case, the Tokenizer should simply include the whitespace characters in the tokens.
Writing a Tokenizer
Each parser will likely have its own
Tokenizer implementation that contains the parser-specific logic about how to
break the content into token objects. Generally, the easiest way to do this is to simply iterate through the character sequence
passed into the
Tokenizer#tokenize(TokenStream.CharacterStream,TokenStream.Tokens) method, and use a switch statement to decide
what to do.
Here is the code for a very basic Tokenizer implementation that ignores whitespace, line breaks and Java-style (multi-line and
end-of-line) comments, while constructing single tokens for each quoted string.
public class BasicTokenizer implements Tokenizer {
public void tokenize( CharacterStream input,
Tokens tokens ) throws ParsingException {
while (input.hasNext()) {
char c = input.next();
switch (c) {
case ' ':
case '\t':
case '\n':
case '\r':
// Just skip these whitespace characters ...
break;
case '-':
case '(':
case ')':
case '{':
case '}':
case '*':
case ',':
case ';':
case '+':
case '%':
case '?':
case '$':
case '[':
case ']':
case '!':
case '':
case '|':
case '=':
case ':':
tokens.addToken(input.index(), input.index() + 1, SYMBOL);
break;
case '.':
tokens.addToken(input.index(), input.index() + 1, DECIMAL);
break;
case '\"':
case '\"':
int startIndex = input.index();
Position startingPosition = input.position();
boolean foundClosingQuote = false;
while (input.hasNext()) {
c = input.next();
if (c == '\\' && input.isNext('"')) {
c = input.next(); // consume the ' character since it is escaped
} else if (c == '"') {
foundClosingQuote = true;
break;
}
}
if (!foundClosingQuote) {
throw new ParsingException(startingPosition, "No matching closing double quote found");
}
int endIndex = input.index() + 1; // beyond last character read
tokens.addToken(startIndex, endIndex, DOUBLE_QUOTED_STRING);
break;
case '\'':
startIndex = input.index();
startingPosition = input.position();
foundClosingQuote = false;
while (input.hasNext()) {
c = input.next();
if (c == '\\' && input.isNext('\'')) {
c = input.next(); // consume the ' character since it is escaped
} else if (c == '\'') {
foundClosingQuote = true;
break;
}
}
if (!foundClosingQuote) {
throw new ParsingException(startingPosition, "No matching closing single quote found");
}
endIndex = input.index() + 1; // beyond last character read
tokens.addToken(startIndex, endIndex, SINGLE_QUOTED_STRING);
break;
case '/':
startIndex = input.index();
if (input.isNext('/')) {
// End-of-line comment ...
boolean foundLineTerminator = false;
while (input.hasNext()) {
c = input.next();
if (c == '\n' || c == '\r') {
foundLineTerminator = true;
break;
}
}
endIndex = input.index(); // the token won't include the '\n' or '\r' character(s)
if (!foundLineTerminator) ++endIndex; // must point beyond last char
if (c == '\r' && input.isNext('\n')) input.next();
if (useComments) {
tokens.addToken(startIndex, endIndex, COMMENT);
}
} else if (input.isNext('*')) {
// Multi-line comment ...
while (input.hasNext() && !input.isNext('*', '/')) {
c = input.next();
}
if (input.hasNext()) input.next(); // consume the '*'
if (input.hasNext()) input.next(); // consume the '/'
if (useComments) {
endIndex = input.index() + 1; // the token will include the '/' and '*' characters
tokens.addToken(startIndex, endIndex, COMMENT);
}
} else {
// just a regular slash ...
tokens.addToken(startIndex, startIndex + 1, SYMBOL);
}
break;
default:
startIndex = input.index();
// Read until another whitespace/symbol/decimal/slash is found
while (input.hasNext() && !(input.isNextWhitespace() || input.isNextAnyOf("/.-(){}*,;+%?$[]!<>|=:"))) {
c = input.next();
}
endIndex = input.index() + 1; // beyond last character that was included
tokens.addToken(startIndex, endIndex, WORD);
}
}
}
}
Tokenizers with exactly this behavior can actually be created using the
#basicTokenizer(boolean) method. So while this very
basic implementation is not meant to be used in all situations, it may be useful in some situations.