Lexical analysis

Description

The lexical analysis is the first phase in compiler design where the user input is scanned and converted into a sequence of tokens. The generated lexical tokens are then provided as input to the syntax analyzer.

I - State machine

A state machine is a graph composed of vertices connected by edges. The graph has a single starting vertex and one or more middle or final tokens. Landing on a final vertex means that a new token is created. Edges allow input characters to traverse the graph by assigning a letter label to each edge.

JSON structure

Global keys table

Key	Description	Value	Required
states	List of states/vertices of the state machine	Array of state objects	Yes
transitions	List of transitions/edges of the state machine	Array of transition objects	Yes

State object table

Key	Key description	Value	Value description	Required
type	Type of the state	`Initial`	Initial state. Exactly one state must be marked as initial	Yes
		`NORMAL`	Normal state.
		`FINAL`	Final state.
id	Unique id for each state	0, 1, 2, ...	Counting starts from 0 and should be consecutive	Yes
token	Name of the token created when landing on a final state	string (e.g. `T_SEMICOLON`)	Cannot be a reserved syntax keyword	Only for final states
backtrack	Backtrack one character in the user input. This is required when the only possible way to know that the token value has ended is by reading a character that is not part of the token value.	false	Do not backtrack	Only for final states
backtrack		true	Backtrack	Only for final states

Transition object table

Key	Key description	Value	Value description	Required
from	State id, from source	0, 1, 2, ...	State id must exist and cannot correspond to a final state.	Yes
to	State id, to destination	0, 1, 2, ...	State id must exist and cannot correspond to the initial state.	Yes
chars	The list of characters that will traverse this edge	Array of char values	Check table	Yes, cannot be empty.

Char values table

Value	Description
Character (e.g. `=`)	A single character
`EOF`	End of file
`NEW_LINE`	New line: `\n`
`RETURN`	Return: `\r`
`SPACE`	Single whitespace
`TAB`	Single tab: `\t`
`LOWER_CASE_LETTER`	A letter from a to z
`UPPER_CASE_LETTER`	A letter from A to Z
`POSITIVE`	A positive digit from 1 to 9
`OTHER`	Any other character. Every non-final state must have a transition on `OTHER`

GUI to JSON

Introduction

A state machine can get large and it will become hard to keep track of states and transitions. To reduce the complexity of building state machines, use the graph.html tool located under tools/gui/ in order to build the machine using a graphical user interface and then export it to JSON.

Note: The tool is written in TypeScript and needs to be compiled before using it.

Compile tool

cd tools/gui/ts/
tsc @compile

II - Lexical configuration

Lexical configuration allow manipulating and tagging generated tokens.

JSON structure

Global keys table

Key	Key description	Value	Value description	Required
newline	Line separator	CR	`\r`	Yes
		LF	`\n`
		CRLF	`\r\n`
ignore	List of lexical tokens to be ignored (e.g. code comments)	Ignore object	Check table	Yes
error	List of lexical tokens resulting in lexical error	Error object	Check table	Yes
reserved	Manipulate token name once created. This is useful when the user input has a reserved keyword and the state machine marked it as a more general name.	Reserved object	Check table	Yes

Ignore object table

Key	Description	Value
prefix	Token name prefix	string
suffix	Token name suffix	string
include	List of all the token names not covered by the `prefix` or `suffix`	Array of strings
exclude	List of all the token names covered by the `prefix` or `suffix` but should not be considered.	Array of strings

Error object table

Key	Description	Value
prefix	Token name prefix	string
suffix	Token name suffix	string
include	List of all the token names not covered by the `prefix` or `suffix`	Array of strings
exclude	List of all the token names covered by the `prefix` or `suffix` but should not be considered.	Array of strings

Reserved object table

Key	Description	Value
Token name (e.g. `T_IDENTIFIER`)	The original token name assigned to the lexical token	Reserved token object
Note: The above row can be applied several times in this object

Reserved token object table

Key	Key description	Value	Value description
string (e.g. `if`)	Input match	string (e.g. `T_IF`)	New token name
Note: The above row can be applied several times in this object

III - Lexical error messages

Error messages reported by error tokens can be customized to provide meaningful information for the user.

JSON structure

Global key table

Key	Key description	Value	Value description	Required
default_message	A default message, used if no specific message is defined for the error token.	string	General error message. The string can contain optional placeholders.	Yes
error_messages	Specific message for an error token	Object of error message	Check table	Yes

Error message object table

Key	Description	Value
Token name (e.g. `T_INVALID_CHAR`)	Print error message when the error token is generated.	Specific error message. The string can contain optional placeholders

Placeholders table

Placeholder	Description
`${value}`	Input value of the lexical token.
`${column}`	Horizontal position of the lexical token.
`${line}`	Line number where the lexical token is found.
`${filename}`	Input file name.