Hone logo
Hone
Problems

Implement a Simple Lexer in Go

This challenge asks you to build a fundamental component of compilers and interpreters: a lexer (also known as a tokenizer). A lexer's primary role is to read a stream of characters representing source code and group them into meaningful units called tokens. This process is crucial for understanding the structure and meaning of the code.

Problem Description

You need to implement a lexer in Go that processes a given string of source code and produces a slice of tokens. Each token should represent a specific type of element found in a simplified programming language, such as keywords, identifiers, operators, and literals.

Key Requirements:

  • Tokenization: The lexer must scan the input string character by character and identify distinct tokens.
  • Token Types: You should define a set of token types to represent different language constructs. At a minimum, consider:
    • Keywords: if, else, func, return, let
    • Identifiers: Variable names, function names (sequences of letters and digits, starting with a letter or underscore).
    • Literals:
      • Integers: Sequences of digits.
      • Strings: Enclosed in double quotes (").
    • Operators: +, -, *, /, =, ==, !=, <, >, <=, >=
    • Punctuation: (, ), {, }, ,, ;
    • Whitespace: Spaces, tabs, newlines (these should generally be ignored or treated as separators).
    • End of File (EOF): A special token to signify the end of the input.
  • Token Structure: Each token should have at least two fields: Type (an enumeration or constant representing the token's kind) and Literal (the actual string value of the token).
  • Error Handling: While not the primary focus, consider how to represent unrecognized characters or malformed tokens (e.g., an unterminated string). For this challenge, you can choose to either ignore them or represent them with a specific "ILLEGAL" token type.

Expected Behavior:

The lexer should consume the entire input string and return a slice containing all the identified tokens in the order they appear. Whitespace characters between tokens should not result in separate tokens but should act as delimiters.

Edge Cases:

  • Empty input string.
  • Input containing only whitespace.
  • Input with consecutive operators or punctuation.
  • Unterminated string literals.
  • Keywords that are also valid identifiers (keywords should take precedence).

Examples

Example 1:

Input: let five = 5;
Output:
[
  {Type: KEYWORD, Literal: "let"},
  {Type: IDENTIFIER, Literal: "five"},
  {Type: OPERATOR, Literal: "="},
  {Type: INTEGER, Literal: "5"},
  {Type: PUNCTUATION, Literal: ";"}
]
Explanation: The input string is broken down into individual tokens. "let" is a keyword, "five" is an identifier, "=" is an operator, "5" is an integer literal, and ";" is punctuation. Whitespace is ignored.

Example 2:

Input:
func add(x, y) {
  return x + y;
}
Output:
[
  {Type: KEYWORD, Literal: "func"},
  {Type: IDENTIFIER, Literal: "add"},
  {Type: PUNCTUATION, Literal: "("},
  {Type: IDENTIFIER, Literal: "x"},
  {Type: PUNCTUATION, Literal: ","},
  {Type: IDENTIFIER, Literal: "y"},
  {Type: PUNCTUATION, Literal: ")"},
  {Type: PUNCTUATION, Literal: "{"},
  {Type: KEYWORD, Literal: "return"},
  {Type: IDENTIFIER, Literal: "x"},
  {Type: OPERATOR, Literal: "+"},
  {Type: IDENTIFIER, Literal: "y"},
  {Type: PUNCTUATION, Literal: ";"},
  {Type: PUNCTUATION, Literal: "}"}
]
Explanation: This example demonstrates how the lexer handles keywords, identifiers, operators, punctuation, and newlines (which act as separators and are not tokenized themselves).

Example 3:

Input: "hello world" == 10
Output:
[
  {Type: STRING, Literal: "hello world"},
  {Type: OPERATOR, Literal: "=="},
  {Type: INTEGER, Literal: "10"}
]
Explanation: The lexer correctly identifies a string literal, a two-character operator "==", and an integer literal.

Constraints

  • The input string will not exceed 1000 characters in length.
  • The input will be a valid Go string.
  • The lexer should process the input in a single pass. Efficiency is important, aim for O(n) time complexity where n is the length of the input string.

Notes

  • You will need to define your token types. An iota-based enumeration in Go is a good way to manage these.
  • Consider using a map to quickly look up keywords.
  • Think about how to handle multi-character operators like ==, !=, <=, >=.
  • Your lexer will likely need to keep track of its current position in the input string.
  • Start by implementing the core logic for identifying simple tokens and then gradually add support for more complex ones like strings and multi-character operators.
  • The primary goal is to correctly identify and categorize tokens. Reporting precise line/column numbers for errors is outside the scope of this basic challenge.
Loading editor...
go