In this post I demonstrate how one can take a character input stream and generate tokens. Tokens generated can be used for another stage. The next stage might be to compile your own language, to parse CSV files. My tokeniser simply serves as a starting point.
In This Post
Top
What is tokenisation
Simply put, tokenisation, takes a character input stream and outputs tokens.
Tokens might include numbers, identifiers that may be part of a programming language, and so on. Essentially, a tokeniser groups characters into more meaningful constructs.
Assume a given character stream contains the characters
int x = 22
A tokeniser would recognise 'int' as an identifier (or word). The same can be said about 'x'. The '=' (Equals) symbol would be recognised as a special symbol. The digits '22' would be recognised as a number. In short, the tokenised output for the supplied text might be...
Identifier:Identifier:Special:Number
TopTokenisation Project
The project is a NET 6.0 console application. The default namespace is Lex.V1.
Token Types
Token types, (numbers,identifiers, special characters, etc) are listed as an enumeration. This ensures client(s) (e.g. parser) can switch on token types required.
namespace Lex.V1
{
public enum TokenType
{
Eof,
Error,
IsValidToken,
Comma,
LeftParen,
RightParen,
LeftBrace,
RightBrace,
Colon,
Semicolon,
Equals,
EqualsEquals,
Plus,
PlusEquals,
PlusPlus,
Identifier,
Integer,
}
}
Top
The Tokeniser Class
using System.Text;
namespace Lex.V1
{
/// <summary>
/// Class to turn a character stream into tokens.
/// </summary>
public class Tokeniser
{
private const char Eof = char.MinValue;
private readonly TextReader _reader;
private readonly StringBuilder _text = new();
private char _char;
public UInt64 Integer { get; private set; }
public string Text =>
_text.ToString();
/// <summary>
/// Need a way to read characters from a text stream.
/// The TextReader class seems suitable.
/// </summary>
/// <param name="reader"></param>
public Tokeniser(TextReader reader)
{
_reader = reader;
// Require lookahead of one character.
_char = ReadChar();
}
/// <summary>
/// Eat whitespace and return next token.
/// Return Eof, if at end of character stream.
/// </summary>
public TokenType Next()
{
// Ignore whitespace.
if (char.IsWhiteSpace(_char))
EatWhitespace();
if (_char == Eof)
return TokenType.Eof;
// Clear text for each token run.
_text.Clear();
// Identifiers are most common, try them first.
if (char.IsLetter(_char) || ('_' == _char))
return BuildIdentifier();
// Check if digit and parse a number.
if (char.IsDigit(_char))
return BuildNumber();
// default token type is an error.
TokenType tt = TokenType.Error;
// Generate token types from characters that are not identifiers or numbers.
switch (_char)
{
case Eof: return TokenType.Eof;
case '(': return Special(TokenType.LeftParen);
case ')': return Special(TokenType.RightParen);
case '{': return Special(TokenType.LeftBrace);
case '}': return Special(TokenType.RightBrace);
case ',': return Special(TokenType.Comma);
case ';': return Special(TokenType.Semicolon);
case ':': return Special(TokenType.Colon);
case '+':
tt = Special(TokenType.Plus);
if (_char == '=')
return Special(TokenType.PlusEquals);
if (_char == '+')
return Special(TokenType.PlusPlus);
return tt;
case '=':
tt = Special(TokenType.Equals);
if (_char == '=')
return Special(TokenType.EqualsEquals);
return tt;
}
_text.Append(_char);
return TokenType.Error;
}
private char ReadChar()
{
int result = _reader.Read();
return (-1 == result) ? Eof : (char)result;
}
private void EatWhitespace()
{
while (char.IsWhiteSpace(_char))
_char = ReadChar();
}
private TokenType BuildIdentifier()
{
while (char.IsLetter(_char) || char.IsDigit(_char) || ('_' == _char))
{
_text.Append(_char);
_char = ReadChar();
}
return TokenType.Identifier;
}
/// <summary>
/// Just supporting integers for now.
/// </summary>
private TokenType BuildNumber()
{
Integer = 0;
while (char.IsDigit(_char))
{
Integer *= 10;
Integer += (UInt64)(_char - '0');
_char = ReadChar();
}
return TokenType.Integer;
}
private TokenType Special(TokenType tt)
{
_text.Append(_char);
_char = ReadChar();
return tt;
}
}
}
Top
Test program
The test program contains sample text to tokenise.
namespace Lex.V1
{
public sealed class Test
{
const string Text =
"int Add(int v1, int v2)\n" +
"{\n" +
" return v1 + v2;\n" +
"}\n" +
"\n" +
"int Main()\n" +
"{\n" +
" int x = 2000 + 46;\n" +
" int y = 4096;\n" +
" int z = 1;" +
" z++;" +
" return Add(x, y);\n" +
"}";
public static void Run()
{
// Load text and loop until erroror end of stream.
using StringReader sr = new(Text);
Tokeniser tokeniser = new(sr);
TokenType tt = tokeniser.Next();
Console.WriteLine(Text);
while (tt > TokenType.IsValidToken)
{
Console.WriteLine($"{tokeniser.Text} : {tt}");
tt = tokeniser.Next();
}
if (tt == TokenType.Error)
Console.WriteLine($"Error: {tokeniser.Text}");
}
}
}
Top
Driver Program
The following code runs the tokeniser, The code is located in program.cs. Due to latest C# compiler, no need for Program class or Main method, it is implied.
Lex.V1.Test.Run();
Top
Sample Output
Click/select image to see details
No comments:
Post a Comment