Software Development With Karl: Tokenisation

In this post I demonstrate how one can take a character input stream and generate tokens. Tokens generated can be used for another stage. The next stage might be to compile your own language, to parse CSV files. My tokeniser simply serves as a starting point.

Top

What is tokenisation

Simply put, tokenisation, takes a character input stream and outputs tokens.

Tokens might include numbers, identifiers that may be part of a programming language, and so on. Essentially, a tokeniser groups characters into more meaningful constructs.

Assume a given character stream contains the characters

int x = 22

A tokeniser would recognise 'int' as an identifier (or word). The same can be said about 'x'. The '=' (Equals) symbol would be recognised as a special symbol. The digits '22' would be recognised as a number. In short, the tokenised output for the supplied text might be...

Identifier:Identifier:Special:Number

Top

Tokenisation Project

The project is a NET 6.0 console application. The default namespace is Lex.V1.

Token Types

Token types, (numbers,identifiers, special characters, etc) are listed as an enumeration. This ensures client(s) (e.g. parser) can switch on token types required.


namespace Lex.V1
{
  public enum TokenType
  {
    Eof,
    Error,

    IsValidToken,

    Comma,
    LeftParen,
    RightParen,
    LeftBrace,
    RightBrace,
    Colon,
    Semicolon,

    Equals,
    EqualsEquals,

    Plus,
    PlusEquals,
    PlusPlus,

    Identifier,
    Integer,
  }
}

Top

The Tokeniser Class


using System.Text;

namespace Lex.V1
{
  /// <summary>
  /// Class to turn a character stream into tokens.
  /// </summary>
  public class Tokeniser
  {
    private const char Eof = char.MinValue;

    private readonly TextReader _reader;
    private readonly StringBuilder _text = new();
    private char _char;

    public UInt64 Integer { get; private set; }

    public string Text =>
      _text.ToString();

    /// <summary>
    /// Need a way to read characters from a text stream.
    /// The TextReader class seems suitable.
    /// </summary>
    /// <param name="reader"></param>
    public Tokeniser(TextReader reader)
    {
      _reader = reader;

      // Require lookahead of one character.
      _char = ReadChar();
    }

    /// <summary>
    /// Eat whitespace and return next token.
    /// Return Eof, if at end of character stream.
    /// </summary>
    public TokenType Next()
    {
      // Ignore whitespace.
      if (char.IsWhiteSpace(_char))
        EatWhitespace();

      if (_char == Eof)
        return TokenType.Eof;

      // Clear text for each token run.
      _text.Clear();

      // Identifiers are most common, try them first.
      if (char.IsLetter(_char) || ('_' == _char))
        return BuildIdentifier();

      // Check if digit and parse a number.
      if (char.IsDigit(_char))
        return BuildNumber();

      // default token type is an error.
      TokenType tt = TokenType.Error;

      // Generate token types from characters that are not identifiers or numbers.
      switch (_char)
      {
        case Eof: return TokenType.Eof;
        case '(': return Special(TokenType.LeftParen);
        case ')': return Special(TokenType.RightParen);
        case '{': return Special(TokenType.LeftBrace);
        case '}': return Special(TokenType.RightBrace);
        case ',': return Special(TokenType.Comma);
        case ';': return Special(TokenType.Semicolon);
        case ':': return Special(TokenType.Colon);

        case '+':
          tt = Special(TokenType.Plus);
          if (_char == '=')
            return Special(TokenType.PlusEquals);
          if (_char == '+')
            return Special(TokenType.PlusPlus);
          return tt;

        case '=':
          tt = Special(TokenType.Equals);
          if (_char == '=')
            return Special(TokenType.EqualsEquals);
          return tt;
      }

      _text.Append(_char);
      return TokenType.Error;
    }

    private char ReadChar()
    {
      int result = _reader.Read();
      return (-1 == result) ? Eof : (char)result;
    }

    private void EatWhitespace()
    {
      while (char.IsWhiteSpace(_char))
        _char = ReadChar();
    }

    private TokenType BuildIdentifier()
    {
      while (char.IsLetter(_char) || char.IsDigit(_char) || ('_' == _char))
      {
        _text.Append(_char);
        _char = ReadChar();
      }
      return TokenType.Identifier;
    }

    /// <summary>
    /// Just supporting integers for now.
    /// </summary>
    private TokenType BuildNumber()
    {
      Integer = 0;
      while (char.IsDigit(_char))
      {
        Integer *= 10;
        Integer += (UInt64)(_char - '0');
        _char = ReadChar();
      }
      return TokenType.Integer;
    }

    private TokenType Special(TokenType tt)
    {
      _text.Append(_char);
      _char = ReadChar();
      return tt;
    }
  }
}

Top

Test program

The test program contains sample text to tokenise.


namespace Lex.V1
{
  public sealed class Test
  {
    const string Text =
    "int Add(int v1, int v2)\n" +
    "{\n" +
    "  return v1 + v2;\n" +
    "}\n" +
    "\n" +
    "int Main()\n" +
    "{\n" +
    "  int x = 2000 + 46;\n" +
    "  int y = 4096;\n" +
    "  int z = 1;" +
    "  z++;" +
    "  return Add(x, y);\n" +
    "}";

    public static void Run()
    {
      // Load text and loop until erroror end of stream.
      using StringReader sr = new(Text);
      Tokeniser tokeniser = new(sr);
      TokenType tt = tokeniser.Next();

      Console.WriteLine(Text);

      while (tt > TokenType.IsValidToken)
      {
        Console.WriteLine($"{tokeniser.Text} 	: {tt}");
        tt = tokeniser.Next();
      }

      if (tt == TokenType.Error)
        Console.WriteLine($"Error: {tokeniser.Text}");
    }
  }
}

Top

Driver Program

The following code runs the tokeniser, The code is located in program.cs. Due to latest C# compiler, no need for Program class or Main method, it is implied.


Lex.V1.Test.Run();

Top

Sample Output

Click/select image to see details

Software Development With Karl

Search This Blog

Saturday, 9 April 2022

Tokenisation