Lexical analysis
Description
The lexical analysis is the first phase in compiler design where the user input is scanned and converted into a sequence of tokens. The generated lexical tokens are then provided as input to the syntax analyzer.
I - State machine
A state machine is a graph composed of vertices connected by edges. The graph has a single starting vertex and one or more middle or final tokens. Landing on a final vertex means that a new token is created. Edges allow input characters to traverse the graph by assigning a letter label to each edge.
JSON structure
Global keys table
Key | Description | Value | Required |
---|---|---|---|
states | List of states/vertices of the state machine | Array of state objects | Yes |
transitions | List of transitions/edges of the state machine | Array of transition objects | Yes |
State object table
Key | Key description | Value | Value description | Required |
---|---|---|---|---|
type | Type of the state | Initial |
Initial state. Exactly one state must be marked as initial | Yes |
NORMAL |
Normal state. | |||
FINAL |
Final state. | |||
id | Unique id for each state | 0, 1, 2, ... | Counting starts from 0 and should be consecutive | Yes |
token | Name of the token created when landing on a final state | string (e.g. T_SEMICOLON ) |
Cannot be a reserved syntax keyword | Only for final states |
backtrack | Backtrack one character in the user input. This is required when the only possible way to know that the token value has ended is by reading a character that is not part of the token value. | false | Do not backtrack | Only for final states |
true | Backtrack |
Transition object table
Key | Key description | Value | Value description | Required |
---|---|---|---|---|
from | State id, from source | 0, 1, 2, ... | State id must exist and cannot correspond to a final state. | Yes |
to | State id, to destination | 0, 1, 2, ... | State id must exist and cannot correspond to the initial state. | Yes |
chars | The list of characters that will traverse this edge | Array of char values | Check table | Yes, cannot be empty. |
Char values table
Value | Description |
---|---|
Character (e.g. = ) |
A single character |
EOF |
End of file |
NEW_LINE |
New line: \n |
RETURN |
Return: \r |
SPACE |
Single whitespace |
TAB |
Single tab: \t |
LOWER_CASE_LETTER |
A letter from a to z |
UPPER_CASE_LETTER |
A letter from A to Z |
POSITIVE |
A positive digit from 1 to 9 |
OTHER |
Any other character. Every non-final state must have a transition on OTHER |
GUI to JSON
Introduction
A state machine can get large and it will become hard to keep track of states and transitions. To reduce the
complexity of building state machines, use the graph.html
tool located under tools/gui/
in order to
build the machine using a graphical user interface and then export it to JSON.
Note: The tool is written in TypeScript and needs to be compiled before using it.
Compile tool
cd tools/gui/ts/
tsc @compile
II - Lexical configuration
Lexical configuration allow manipulating and tagging generated tokens.
JSON structure
Global keys table
Key | Key description | Value | Value description | Required |
---|---|---|---|---|
newline | Line separator | CR | \r |
Yes |
LF | \n |
|||
CRLF | \r\n |
|||
ignore | List of lexical tokens to be ignored (e.g. code comments) | Ignore object | Check table | Yes |
error | List of lexical tokens resulting in lexical error | Error object | Check table | Yes |
reserved | Manipulate token name once created. This is useful when the user input has a reserved keyword and the state machine marked it as a more general name. | Reserved object | Check table | Yes |
Ignore object table
Key | Description | Value |
---|---|---|
prefix | Token name prefix | string |
suffix | Token name suffix | string |
include | List of all the token names not covered by the prefix or suffix |
Array of strings |
exclude | List of all the token names covered by the prefix or suffix but should
not be considered. |
Array of strings |
Error object table
Key | Description | Value |
---|---|---|
prefix | Token name prefix | string |
suffix | Token name suffix | string |
include | List of all the token names not covered by the prefix or suffix |
Array of strings |
exclude | List of all the token names covered by the prefix or suffix but should
not be considered. |
Array of strings |
Reserved object table
Key | Description | Value |
---|---|---|
Token name (e.g. T_IDENTIFIER ) |
The original token name assigned to the lexical token | Reserved token object |
Note: The above row can be applied several times in this object |
Reserved token object table
Key | Key description | Value | Value description |
---|---|---|---|
string (e.g. if ) |
Input match | string (e.g. T_IF ) |
New token name |
Note: The above row can be applied several times in this object |
III - Lexical error messages
Error messages reported by error tokens can be customized to provide meaningful information for the user.
JSON structure
Global key table
Key | Key description | Value | Value description | Required |
---|---|---|---|---|
default_message | A default message, used if no specific message is defined for the error token. | string | General error message. The string can contain optional placeholders. | Yes |
error_messages | Specific message for an error token | Object of error message | Check table | Yes |
Error message object table
Key | Description | Value |
---|---|---|
Token name (e.g. T_INVALID_CHAR ) |
Print error message when the error token is generated. | Specific error message. The string can contain optional placeholders |
Placeholders table
Placeholder | Description |
---|---|
${value} |
Input value of the lexical token. |
${column} |
Horizontal position of the lexical token. |
${line} |
Line number where the lexical token is found. |
${filename} |
Input file name. |