Hone logo
Hone
Problems

Simple Lexer Generator in JavaScript

This challenge asks you to implement a basic lexer generator in JavaScript. A lexer (or tokenizer) is the first phase of a compiler or interpreter, responsible for breaking down a stream of characters (source code) into a stream of tokens. This generator will take a set of regular expressions and their corresponding token types as input and produce a lexer function.

Problem Description

You need to create a function called lexerGenerator that accepts an array of token definitions. Each token definition is an object with two properties: regex (a regular expression string) and type (a string representing the token type). The lexerGenerator function should return a lexer function. The returned lexer function should take a string of source code as input and return an array of tokens. Each token in the array should be an object with type and value properties.

The lexer should iterate through the token definitions in the order they are provided. For each definition, it should attempt to match the regular expression against the source code. If a match is found, a token object should be created with the type from the definition and the value being the matched substring. The matched portion of the source code should then be consumed, and the process should repeat. If no token definition matches the remaining source code, the lexer should return null.

Key Requirements:

  • The lexer should consume the input string as it finds tokens.
  • Token definitions should be processed in the order they are provided.
  • The lexer should return null if it cannot find a matching token.
  • The returned lexer function should be able to handle empty input strings.
  • The regular expressions should be compiled once when the lexer is generated, not on every call to the lexer function.

Expected Behavior:

The lexer function should return an array of token objects. Each token object should have a type and a value property. The type should correspond to the token type defined in the token definition, and the value should be the matched substring.

Examples

Example 1:

Input: [
  { regex: /^[a-zA-Z]+$/, type: 'IDENTIFIER' },
  { regex: /^\d+$/, type: 'NUMBER' },
  { regex: /^\s+$/, type: 'WHITESPACE' }
]
Source Code: "  hello 123 "
Output: [
  { type: 'WHITESPACE', value: '  ' },
  { type: 'IDENTIFIER', value: 'hello' },
  { type: 'NUMBER', value: '123' },
  { type: 'WHITESPACE', value: ' ' }
]
Explanation: The lexer first matches the whitespace, then the identifier, then the number, and finally another whitespace.

Example 2:

Input: [
  { regex: /^\(/, type: 'LPAREN' },
  { regex: /^\)/, type: 'RPAREN' },
  { regex: /^,/, type: 'COMMA' }
]
Source Code: "(1,2)"
Output: [
  { type: 'LPAREN', value: '(' },
  { type: 'NUMBER', value: '1' },
  { type: 'COMMA', value: ',' },
  { type: 'NUMBER', value: '2' },
  { type: 'RPAREN', value: ')' }
]
Explanation: The lexer matches the parentheses and comma, and assumes the numbers are handled elsewhere.

Example 3: (Edge Case - No Match)

Input: [
  { regex: /^[a-zA-Z]+$/, type: 'IDENTIFIER' }
]
Source Code: "123"
Output: null
Explanation: The identifier regex doesn't match the input, so the lexer returns null.

Constraints

  • The regular expressions in the token definitions should be valid JavaScript regular expressions.
  • The source code input to the lexer function will be a string.
  • The token types will be strings.
  • The regex property must be a string representing a regular expression.
  • The type property must be a string.
  • The lexer should be reasonably efficient; avoid unnecessary computations. Regular expression compilation should happen only once.
  • The input array of token definitions will contain at least one element.

Notes

  • Consider using the RegExp constructor to create regular expression objects from the strings provided in the token definitions.
  • The exec() method of the RegExp object is useful for matching regular expressions against strings.
  • Think about how to handle the case where a token definition matches a portion of the source code, but not the entire remaining string.
  • The order of token definitions is important. The lexer should try to match tokens in the order they are defined.
  • Whitespace handling is a common requirement in lexers. Consider how to handle whitespace tokens. You don't need to remove whitespace, just identify it as a token.
  • This is a simplified lexer generator. It does not handle error recovery or complex token precedence rules. The focus is on the core functionality of matching regular expressions and creating tokens.
Loading editor...
javascript