Smacc: a Compiler-Compiler

SmaCC Directives

SmaCC has several directives that can change how the scanner and parser is generated. Each directive begins with a % character and the directive keyword. Depending on the directive, there may be a set of arguments. Finally, the directive is terminated with a semicolon character, ; as shown below:

%left "+" "-";
%left "*" "/";
%right "^";
%annotate_tokens;
%root Expression;
%prefix AST;
%suffix Node;
%ignore_variables leftParenToken rightParenToken;

Start Symbols

By default, the left-hand side of the first grammar rule is the start symbol. If you want to multiple start symbols, you can specify them by using the %start directive followed by the nonterminals that are additional start symbols. This is useful for creating two parsers with grammars that are similar but slightly different. For example, consider a Pharo parser. You can parse methods, and you can parse expressions. These are two different operations, but have similar grammars. Instead of creating two different parsers for parsing methods and expressions, we can specify one grammar that parses methods, and also specify an alternative start symbol for parsing expressions.

The StParser in the SmaCC Example Parsers package has an example of this. The method StParser class>>parseMethod: uses the startingStateForMethod position to parse methods and the method StParser class>>parseExpression: uses the startingStateForSequenceNode position to parse expressions.

For example if you add the following to an hypothetical grammar:

%start file expression statement declaration;

SmaCC will generate the following class methods on the parser: startingStateForfile, startingStateForexpression, startingStateForstatement and startingStateFordeclaration. Then you can parse a subpart as follows:

YourParser >> parseStatement: aString
	"Parse an statement."

	^ (self on: (ReadStream on: aString))
		setStartingState: self startingStateForstatement;
		parse

The ability to specify multiple start symbols is useful when you build your grammar incrementally. You might also want to create additionnal start symbols to test grammar features independently.

Id Methods

Internally, the various token types are represented as integers. However, there are times when you need to reference the token types. For example, in the CScanner and CParser classes, the TYPE_NAME token has a syntax identical in the IDENTIFIER token. To distinguish them, the IDENTIFIER matching method does a lookup in the type table: if it finds a type definition with the same name as the current IDENTIFIER, it returns the TYPE_NAME token type. To determine what integer this is, the parser includes an %id directive for <IDENTIFIER> and <TYPE_NAME>. This generates the IDENTIFIERId and TYPE_NAMEId methods on the scanner. These methods simply return the integer representing that token type. See the C sample scanner and parser for an example of how the %id directive is used.

Case Insensitive Scanning

You can specify that the scanner should ignore case differences by using the %ignorecase; directive. If you have a language that is case insensitive and has several keywords, this can be a handy feature. For example, if you have THEN as a keyword in a case insensitive language, you would need to specify the token for then as <then> : [tT] [hH] [eE] [nN] ;. This is a pain to enter correctly. When the ignorecase directive is used, SmaCC will automatically convert THEN into [tT][hH][eE][nN].

AST Directives

There are several directives that are used when creating AST's.

The %root directive is used to specify the root class in the AST hierarchy. The %root directive has a single argument that is the name that will be used to create the root class in the AST. This class will be created as a subclass of SmaCCParseNode.

The %prefix and %suffix directives tell SmaCC the prefix and suffix to add to create the node name for the AST node's class. This prefix and suffix are added to the name of every AST node, including the %root node. For example, the following will create a RBProgramNode class that is a subclass of SmaCCParseNode and is the root of all AST nodes defined by this parser.

%root Program;
%prefix RB;
%suffix Node;

By default all nodes created by SmaCC will be direct subclass of your %root class. However, you can specify the class hierarchy by using the %hierarchy directive. The syntax of the %hierarchy is %hierarchy SuperclassName "(" SubclassName + ");". If you have multiple subclasses, you can list all of them inside the parenthesis, separated by whitespace, as follows.

%hierarchy Program (Expression Statement);

Three AST directives deal with the generated classes' instance variables.

The %annotate_tokens tells SmaCC to generate instance variable names for any unnamed tokens in the grammar rules. Without this directive, an unnamed token will generate the warning:

Unnamed symbol in production. Without a variable name the value will be dropped from the parsed AST."

With the directive, a variable name will be auto-generated, using the name of the symbol follwed by Token. So the symbol <op> would be given the name <opToken>.

The %attributes directive allows you to add some extra instance variables to your classes. This enables you to later extend the generated classes to use those variables. The first argument to the %attributes directive is the node name (without the prefix and suffix); the second argument is a parenthesised list of variable names. For example, we could add an instance variable cachedValue to the Expression class with %attributes Expression (cachedValue);.

Note that if you do not use this directive, but simply add the instance variables to the classes by hand, SmaCC will remove them the next time that the classes are re-generated. Then your instance variables will become undeclared, and any code that uses them will start to behave unexpectedly. This can be the explanation for unexplained and inconsistent behaviour.

The final instance variable directive is %ignore_variables. When SmaCC creates the AST node classes, it automatically generates appropriate = and hash methods. By default, these methods use all instance variables when comparing for equality and computing the hash. The %ignore_variables directive allows you to specify that certain variables should be ignored. For example, you may wish to ignore parentheses when you compare expressions. If you named your ( token 'leftParen' and your ) token 'rightParen', then you can specify this with %ignore_variables leftParen rightParen;.

Dealing with Ambiguous Grammars

An ambiguity occurs in a grammar when for a single lookahead token, the parser can execute two or more different actions. Which one should the parser choose?

In traditionnal LR parsing, there are two types of conflicts between two actions that can occur: reduce/reduce and shift/reduce. A reduce/reduce conflict exists when the parser can either reduce using one rule in the grammar or reduce using another rule. A shift/reduce conflict exists when the parser can either shift the current lookahead token or reduce using a given rule.

LR, LALR and Ambiguous Grammars

When a LR or LALR parser is generated from an ambiguous grammar, conflicts will be displayed in the "Message" box. The resulting parser will choose an arbitrary action to execute between the ones available. Since it not a behaviour you usually want, you have several options:

The last two options will be detailed in the following sections.

Precedence Rules

When SmaCC encounters a shift/reduce conflict it will perform the shift action by default. However, you can control this action with the %left, %right, and %nonassoc directives. If a token has been declared in a %left directive, it means that the token is left-associative. Therefore, the parser will perform a reduce operation. However, if it has been declared as right-associative, it will perform a shift operation. A token defined as %nonassoc will produce an error if that is encountered during parsing. For example, you may want to specify that the equal operator, "=", is non-associative, so a = b = c is not parsed as a valid expression. All three directives are followed by a list of tokens.

Additionally, the %left, %right, and %nonassoc directives allow precedence to be specified. The order of the directives specifies the precedence of the tokens. The higher precedence tokens appear on the higher line numbers. For example, the following directive section gives the precedence for the simple calculator in our tutorial:

%left "+" "-";
%left "*" "/";
%right "^";

The symbols + and - appear on the first line, and hence have the lowest precedence. They are also left-associative, so 1 + 2 + 3 will be evaluated as (1 + 2) + 3. On the next line we see the symbols * and /; since they appear on a line with a higher line number, they have higher precedence than + and -. Finally, on line three we have the ^ symbol. It has the highest precedence, but is right associative. Combining all the rules allows us to parse 1 + 2 * 3 / 4 ^ 2 ^ 3 as 1 + ((2 * 3) / (4 ^ (2 ^ 3))).

GLR Parsing

SmaCC allows you to parse ambiguous grammars using a GLR parser. The %glr; directive changes the type of parser that SmaCC generates (you can also used %parser glr;). Instead of your generated parser being a subclass of SmaCCParser, when you use the %glr; directive, your parser will be a subclass of SmaCCGLRParser.

If you parse a string that has multiple representations, SmaCC will throw a SmaCCAmbiguousResultNotification exception, which can be handled by user code. This exception has the potential parses. The value with which it is resumed with will be selected as the definitive parse value. If the exception is not handled, then SmaCC will pick one as the definitive parse value.

To handle all the ambiguities of a program, the GLR parser performs multiple parses in parallel. No individual parse backtracks, but when switching between parses the scanner may backtrack. To support this, the scanner implements the methods currentState (which reifies the scanner's state into an object) and restoreState: (which expects as its parameter an object produced by currentState). If you have added instance variables to your scanner, then you will need to override these two methods to save and restore your instance variables. The same is true for the parser: you can save and restore its state using the duplicateState and restoreState: methods respectively. Be sure to override those methods if you have special instance variables in your parser.

If you have overlapping tokens, and have overridden the method tryAllTokens to return true, then in a GLR parser, SmaCC will try to perform a separate parse with each possible interpretation of the token. In this case, SmaCC may defer a parser action until it has decided which interpretation to pursue. Normally, this deferral will not be noticeable, but if the parser actions affect the scanner state, the scanner's behaviour will be changed. This is likely to happen if your scanner has multiple states.