Hone logo
Hone
Problems

Building a Parsing Expression Grammar (PEG) Parser in Python

This challenge asks you to implement a parser for a simplified Parsing Expression Grammar (PEG) in Python. PEGs are a powerful formalism for defining grammars and are often used for tasks like language parsing, configuration file processing, and data serialization. Understanding how to build a PEG parser will equip you with a fundamental tool for many programming tasks.

Problem Description

Your goal is to create a Python class, let's call it PEGParser, that can parse input strings according to a grammar defined using a set of simple PEG rules. The parser should be able to handle basic PEG constructs and return a structured representation of the parsed input, or indicate a parsing failure.

Key Requirements:

  1. Rule Definitions: The parser should accept a dictionary where keys are rule names (strings) and values are the PEG expressions representing those rules.
  2. PEG Expression Syntax: Support the following PEG primitives:
    • Literal String ("abc"): Matches the exact string.
    • Character Class ([a-z] or [abc]): Matches any single character within the class.
    • Any Character (.): Matches any single character (except newline by default).
    • Sequence (expr1 expr2): Matches expr1 followed by expr2.
    • Ordered Choice (expr1 / expr2): Matches expr1 if it succeeds, otherwise tries expr2.
    • Zero or More (expr*): Matches expr zero or more times.
    • One or More (expr+): Matches expr one or more times.
    • Optional (expr?): Matches expr zero or one time.
    • Rule Reference (rulename): Matches the result of another defined rule.
  3. Parsing Function: Implement a parse(start_rule, input_string) method that attempts to parse the input_string starting from the start_rule.
  4. Output Structure:
    • On successful parsing, return a nested list or tuple structure representing the parsed input. For sequences, elements should be in order. For ordered choices, return the result of the successful branch. For repetitions (*, +), return a list of parsed sub-results. For character classes and literals, return the matched string.
    • On failure, raise a custom ParseError exception with information about the failure location and expected tokens.
  5. Memoization (Optional but Recommended): To improve performance for grammars with left recursion or repeated sub-problems, implement memoization.

Expected Behavior:

  • The parser should consume the input string from left to right.
  • If a rule or expression matches, the parser should advance its position in the input string.
  • If a rule or expression fails to match at the current position, the parser should backtrack (unless a successful match has already occurred in an ordered choice).
  • The parse method should return the complete parse tree if the start_rule successfully parses the entire input string. If the start_rule parses a prefix but leaves unconsumed input, it should be considered a failure.

Edge Cases:

  • Empty input string.
  • Grammar with no rules.
  • Grammar with undefined rule references.
  • Input string that doesn't match the grammar at all.
  • Input string that matches a prefix but not the whole string.
  • Grammars that might exhibit left recursion (though a simple PEG parser won't directly handle it as typically defined, the structure should be robust).

Examples

Example 1: Simple Arithmetic Expression

Let's define a grammar for very simple integer arithmetic: expr = term ('+' term)*, term = '1'.

Input Grammar:

grammar = {
    "expr": 'term ("+" term)*',
    "term": '"1"'
}

Input String: "1+1+1"

Output:

['1', ['+', '1'], ['+', '1']]

Explanation: The expr rule first matches "1" (as term). Then, it encounters "+", which matches. It then matches another "1" (as term). This process repeats for the remaining "+1" parts. The structure represents the sequence of terms and the optional + term repetitions.

Example 2: Basic List Parsing

Grammar for a list of identifiers separated by commas: list = ident ("," ident)*, ident = [a-z]+.

Input Grammar:

grammar = {
    "list": 'ident ("," ident)*',
    "ident": '[a-z]+'
}

Input String: "abc,def,ghi"

Output:

['abc', [',', 'def'], [',', 'ghi']]

Explanation: The list rule first matches "abc" as ident. Then it matches "," followed by "def" as the next ident. This repeats for "," and "ghi".

Example 3: Handling Choices and Optionality

Grammar for a simple command: command = "QUIT" / ("HELLO" name?), name = [A-Z]+.

Input Grammar:

grammar = {
    "command": '"QUIT" / ("HELLO" name?)',
    "name": '[A-Z]+'
}

Input String 1: "QUIT"

Output:

'QUIT'

Explanation: The command rule tries "QUIT" first, which matches.

Input String 2: "HELLO Alice"

Output:

['HELLO', 'ALICE']

Explanation: The command rule tries "QUIT" (fails). It then tries the second part of the choice: "HELLO" name?. It matches "HELLO", then it matches "ALICE" as name.

Input String 3: "HELLO"

Output:

['HELLO', None]

Explanation: The command rule tries "QUIT" (fails). It then tries "HELLO" name?. It matches "HELLO". The name? part is optional and doesn't match anything, so it results in None.

Constraints

  • The grammar definition will consist of up to 100 rules.
  • The input string length will be at most 1000 characters.
  • Rule names will be alphanumeric strings starting with a letter.
  • Literal strings will be enclosed in double quotes and can contain alphanumeric characters.
  • Character classes will be enclosed in square brackets [] and can contain ranges (e.g., a-z) or individual characters.
  • The parser should aim to complete parsing within a reasonable time for the given constraints, suggesting memoization is beneficial.

Notes

  • You will need to manage the current position within the input string carefully.
  • Consider how to represent the parsed output for each PEG primitive. For example, a sequence might be a list of its parsed components, a choice might be the result of the successful branch, and repetitions (*, +) might be lists of parsed sub-results. Literals and character classes can return the matched string.
  • Think about how to handle failures and report useful error messages.
  • For the ordered choice (/), remember that the first successful match terminates the choice.
  • For character classes, you'll need a way to check if a character belongs to the specified class.
  • This challenge focuses on the core PEG parsing logic. You don't need to implement a full-fledged lexer or abstract syntax tree (AST) builder, but your output structure should be useful for those next steps.
Loading editor...
python