Smacc: a Compiler-Compiler

SmaCC Scanner

Scanning takes an input stream of characters and converts that into a stream of tokens. The tokens are then passed on to the parsing phase.

The scanner is specified by a collection of token specifications. Each token is specified by:

TokenName    :    RegularExpression ;

TokenName is a valid variable name surrounded by <>. For example, <token> is a valid TokenName, but <token name> is not, as token name is not a valid variable name. The RegularExpression is a regular expression that matches a token. It should match one or more characters in the input stream. The colon character, :, is used to separate the TokenName and the RegularExpression, and the semicolon character, ;, is used to terminate the token specification.

Regular Expression Syntax

While the rules are specified as regular expressions, there are many different syntaxes for regular expressions. SmaCC uses a relatively simple syntax, which is specified below. If you wish to have a richer syntax, you can modify the scanner's parser: SmaCCDefinitionScanner and SmaCCDefinitionParser. These classes were created using SmaCC and can be studied.

Matches a special character. The character immediately following the backslash is matched exactly, unless it is a letter. Backslash-letter combinations have other meanings and are specified below.
Matches a control character. Control characters are the first 26 characters (e.g., \cA equals Character value: 0). The letter that follows the \c must be an uppercase letter.
Matches a digit, 0-9.
Matches anything that is not a digit.
Matches a form-feed character, Character value: 12.
Matches a newline character, Character value: 10.
Matches a carriage return character, Character value: 13.
Matches any whitespace character, [ \f\n\r\t\v].
Matches any non-whitespace character.
Matches a tab, Character value: 9.
Matches a vertical tab, Character value: 11.
Matches any letter, number or underscore, [A-Za-z0-9_].
Matches anything that is not a letter, number or underscore.
Matches a character specified by the hex number following the \x. The hex number must be at least one character long and no more than four characters for Unicode characters and two characters for non-Unicode characters. For example, \x20 matches the space character (Character value: 16r20), and \x1FFF matches Character value: 16r1FFF.
Copies the definition of <token> into the current regular expression. For example, if we have <hexdigit> : \d | [A-F] ;, we can use <hexdigit> in a later rule: <hexnumber> : <hexdigit> + ;. Note that you must define a token before you use it in another rule.
Copies the characters where Character>>isMethod returns true into the current regular expression. For example, instead of using \d, we could use <isDigit> since Character>>isDigit returns true for digits.
Matches one of the characters inside the [ ]. This is a shortcut for the | operator. In addition to single characters, you can also specify character ranges with the - character. For example, [a-z] matches any lower case letter.
Matches any character not listed in the characters block. [^a] matches anything except for a.
# comment
Creates a comment that is ignored by SmaCC. Everything from the # to the end of the line is ignored.
exp1 | exp2
Matches either exp1 or exp2.
exp1 exp2
Matches exp1 followed by exp2. \d \d matches two digits.
Matches exp zero or more times. 0* matches '' and 000.
Matches exp zero or one time. 0? matches only '' or 0.
Matches exp one or more times. 0+ matches 0 and 000, but not ''.
Matches exp at least min times but no more than max times. 0{1,2} matches only 0 or 00. It does not match '' or 000.
Groups exp for precedence. For example, (a b)* matches ababab. Without the parentheses, a b * would match abbbb but not ababab.

Since there are multiple ways to combine expressions, we need precedence rules for their combination. The or operator, |, has the lowest precedence and the *, ?, +, and {,} operators have the highest precedence. For example, a | b c * matches a or bcccc, but not accc or bcbcbc. If you wish to match a or b followed by any number of c's, you need to use (a | b) c *.

Whitespace is ignored in SmaCC regular expressions everywhere except within square brackets. This means that you can add spaces between terms to make your REs more readable. However, inside square brackets, spaces are significant, so don't add spaces there unless you mean to include space (or, with ^, to exclude space) from the set of allowable characters.

Overlapping Tokens

SmaCC can handle overlapping tokens without any problems. For example, the following is a legal SmaCC scanner definition:

<variable>	: [a-zA-Z] \w* ;
<any_character>	: . ;

This definition will match a variable or a single character. A variable can also be a single character [a-zA-Z], so the two tokens overlap. SmaCC handles overlapping tokens by preferring the longest matching token. If multiple token definitions match sequences of the same maximum length, first token specified by the grammar is chosen. For example, an a could be a <variable> or an <any_character> token, but since <variable> is specified first, SmaCC will prefer it. SmaCC associate automatically a numerical id with each token name; overlapping tokens are implemented as a list of ids, and the preferred id is the first one.

If you want the parser to attempt to parse will all the possible kinds of token, override the method SmaCCParser>>tryAllTokens in your parser to answer true instead of false. The effect of #tryAllTokens depends on the type of parser generated. If GLR, then the parser will fork on all the ids of the token. If non GLR (that is LR(1) or LALR(1)), the parser will try the other ids of the token if the first one triggers an error.

Token Action Methods

A Token Action Method is a hand-written method in your scanner whose name is the same as the name of a token, (for example, the method whitespace). For this reason, token action methods are sometimes also called "matching methods".

A token action method will be executed whenever a token with the corresponding name is recognized. We have already seen that the SmaCCScanner superclass has default implementations of methods whitespace and comment. These methods are executed whenever the tokens <whitespace> and <comment> are scanned. They ignore those tokens and record the comments ranges in the source text (which are made available inside SmaCC generated ASTs, see chapter ). If you want to store comments, then you should study the approach used to record comments in SmaCCScanner and SmaCCParser and eventually modify the SmaCCScanner>>comment method.

When implementing a Token Action Method, you can find the characters that comprise the token in the outputStream, an instance variable inherited from SmaCCScanner. Your method must answer a SmaCCToken. Here are two examples.

	"By default, eat the whitespace"

	self resetScanner.
	^ self scanForToken

This is the default action when spaces are scanned: the scanner is reset, and then used to scan for the token following the spaces. This following token is returned; as a consequence, the spaces are ignored.

	braceDepth := braceDepth + 1.
	^ self createTokenFor: '{'

This is the token action from a scanner that needs to keep track of the number of <leftBrace> tokens. After incrementing a counter, it returns the same token that would have been created if the there had been no token action.

Token Action Methods can also be used to handle overlapping token classes. For example, in the C grammar, a type definition is lexically identical to an identifier. The only way that they can be disambiguated is by looking up the name in the symbol table. In our example C scanner, we have an IDENTIFIER method that is used to determine whether the token is really an IDENTIFIER or whether it is a TYPE_NAME:

	| name |
	name := outputStream contents.
	matchActions := (typeNames includes: name)
		ifTrue: [ Array with: self TypeNameId ]
		ifFalse: [ Array with: self IDENTIFIERId ].
	outputStream reset.
	^ SmaCCToken value: name start: start ids: matchActions

In this example, #TypeNameId and #IDENTIFIERId are methods generated by SmaCC with the %id directive (see subsection).

Unreferenced Tokens

If a token is not referenced from a grammar specification, it will not be included in the generated scanner, unless the token's name is also a name of a method (see previous section). This, coupled with the ability to do substitutions, allows you to have the equivalent of macros within your scanner specification. However, be aware that if you are simply trying to generate a scanner, you will have to make sure that you create a dummy parser specification that references all of the tokens that you want in the final scanner.

Unicode Characters

SmaCC compiles the scanner into a bunch of conditional tests on characters. Normally, it assumes that characters have values between 0 and 255, and it can make some optimizations based on this fact. With the directive %unicode in the input, SmaCC will assume that characters have values between 0 and 65535. Unicode characters outside that range are not presently handled, and SmaCC is significantly slower with this option activated.