Lexical analysis

Description

The lexical analysis is the first phase in compiler design where the user input is scanned and converted into a sequence of tokens. The generated lexical tokens are then provided as input to the syntax analyzer.

I - State machine

A state machine is a graph composed of vertices connected by edges. The graph has a single starting vertex and one or more middle or final tokens. Landing on a final vertex means that a new token is created. Edges allow input characters to traverse the graph by assigning a letter label to each edge.

JSON structure

Global keys table

Key Description Value Required
states List of states/vertices of the state machine Array of state objects Yes
transitions List of transitions/edges of the state machine Array of transition objects Yes

State object table

Key Key description Value Value description Required
type Type of the state Initial Initial state. Exactly one state must be marked as initial Yes
NORMAL Normal state.
FINAL Final state.
id Unique id for each state 0, 1, 2, ... Counting starts from 0 and should be consecutive Yes
token Name of the token created when landing on a final state string (e.g. T_SEMICOLON) Cannot be a reserved syntax keyword Only for final states
backtrack Backtrack one character in the user input. This is required when the only possible way to know that the token value has ended is by reading a character that is not part of the token value. false Do not backtrack Only for final states
true Backtrack

Transition object table

Key Key description Value Value description Required
from State id, from source 0, 1, 2, ... State id must exist and cannot correspond to a final state. Yes
to State id, to destination 0, 1, 2, ... State id must exist and cannot correspond to the initial state. Yes
chars The list of characters that will traverse this edge Array of char values Check table Yes, cannot be empty.

Char values table

Value Description
Character (e.g. =) A single character
EOF End of file
NEW_LINE New line: \n
RETURN Return: \r
SPACE Single whitespace
TAB Single tab: \t
LOWER_CASE_LETTER A letter from a to z
UPPER_CASE_LETTER A letter from A to Z
POSITIVE A positive digit from 1 to 9
OTHER Any other character. Every non-final state must have a transition on OTHER

GUI to JSON

Introduction

A state machine can get large and it will become hard to keep track of states and transitions. To reduce the complexity of building state machines, use the graph.html tool located under tools/gui/ in order to build the machine using a graphical user interface and then export it to JSON.

Note: The tool is written in TypeScript and needs to be compiled before using it.

Compile tool

cd tools/gui/ts/
tsc @compile

II - Lexical configuration

Lexical configuration allow manipulating and tagging generated tokens.

JSON structure

Global keys table

Key Key description Value Value description Required
newline Line separator CR \r Yes
LF \n
CRLF \r\n
ignore List of lexical tokens to be ignored (e.g. code comments) Ignore object Check table Yes
error List of lexical tokens resulting in lexical error Error object Check table Yes
reserved Manipulate token name once created. This is useful when the user input has a reserved keyword and the state machine marked it as a more general name. Reserved object Check table Yes

Ignore object table

Key Description Value
prefix Token name prefix string
suffix Token name suffix string
include List of all the token names not covered by the prefix or suffix Array of strings
exclude List of all the token names covered by the prefix or suffix but should not be considered. Array of strings

Error object table

Key Description Value
prefix Token name prefix string
suffix Token name suffix string
include List of all the token names not covered by the prefix or suffix Array of strings
exclude List of all the token names covered by the prefix or suffix but should not be considered. Array of strings

Reserved object table

Key Description Value
Token name (e.g. T_IDENTIFIER) The original token name assigned to the lexical token Reserved token object
Note: The above row can be applied several times in this object

Reserved token object table

Key Key description Value Value description
string (e.g. if) Input match string (e.g. T_IF) New token name
Note: The above row can be applied several times in this object

III - Lexical error messages

Error messages reported by error tokens can be customized to provide meaningful information for the user.

JSON structure

Global key table

Key Key description Value Value description Required
default_message A default message, used if no specific message is defined for the error token. string General error message. The string can contain optional placeholders. Yes
error_messages Specific message for an error token Object of error message Check table Yes

Error message object table

Key Description Value
Token name (e.g. T_INVALID_CHAR) Print error message when the error token is generated. Specific error message. The string can contain optional placeholders

Placeholders table

Placeholder Description
${value} Input value of the lexical token.
${column} Horizontal position of the lexical token.
${line} Line number where the lexical token is found.
${filename} Input file name.