Building a Parsing Expression Grammar (PEG) Parser in Python

This challenge asks you to implement a parser for a simplified Parsing Expression Grammar (PEG) in Python. PEGs are a powerful formalism for defining grammars and are often used for tasks like language parsing, configuration file processing, and data serialization. Understanding how to build a PEG parser will equip you with a fundamental tool for many programming tasks.

Problem Description

Your goal is to create a Python class, let's call it PEGParser, that can parse input strings according to a grammar defined using a set of simple PEG rules. The parser should be able to handle basic PEG constructs and return a structured representation of the parsed input, or indicate a parsing failure.

Key Requirements:

Rule Definitions: The parser should accept a dictionary where keys are rule names (strings) and values are the PEG expressions representing those rules.
PEG Expression Syntax: Support the following PEG primitives:
- Literal String ("abc"): Matches the exact string.
- Character Class ([a-z] or [abc]): Matches any single character within the class.
- Any Character (.): Matches any single character (except newline by default).
- Sequence (expr1 expr2): Matches expr1 followed by expr2.
- Ordered Choice (expr1 / expr2): Matches expr1 if it succeeds, otherwise tries expr2.
- Zero or More (expr*): Matches expr zero or more times.
- One or More (expr+): Matches expr one or more times.
- Optional (expr?): Matches expr zero or one time.
- Rule Reference (rulename): Matches the result of another defined rule.
Parsing Function: Implement a parse(start_rule, input_string) method that attempts to parse the input_string starting from the start_rule.
Output Structure:
- On successful parsing, return a nested list or tuple structure representing the parsed input. For sequences, elements should be in order. For ordered choices, return the result of the successful branch. For repetitions (*, +), return a list of parsed sub-results. For character classes and literals, return the matched string.
- On failure, raise a custom ParseError exception with information about the failure location and expected tokens.
Memoization (Optional but Recommended): To improve performance for grammars with left recursion or repeated sub-problems, implement memoization.

Expected Behavior:

The parser should consume the input string from left to right.
If a rule or expression matches, the parser should advance its position in the input string.
If a rule or expression fails to match at the current position, the parser should backtrack (unless a successful match has already occurred in an ordered choice).
The parse method should return the complete parse tree if the start_rule successfully parses the entire input string. If the start_rule parses a prefix but leaves unconsumed input, it should be considered a failure.

Edge Cases:

Empty input string.
Grammar with no rules.
Grammar with undefined rule references.
Input string that doesn't match the grammar at all.
Input string that matches a prefix but not the whole string.
Grammars that might exhibit left recursion (though a simple PEG parser won't directly handle it as typically defined, the structure should be robust).

Examples

Example 1: Simple Arithmetic Expression

Let's define a grammar for very simple integer arithmetic: expr = term ('+' term)*, term = '1'.

Input Grammar:

grammar = {
    "expr": 'term ("+" term)*',
    "term": '"1"'
}

Input String: "1+1+1"

Output:

['1', ['+', '1'], ['+', '1']]

Explanation: The expr rule first matches "1" (as term). Then, it encounters "+", which matches. It then matches another "1" (as term). This process repeats for the remaining "+1" parts. The structure represents the sequence of terms and the optional + term repetitions.

Example 2: Basic List Parsing

Grammar for a list of identifiers separated by commas: list = ident ("," ident)*, ident = [a-z]+.

Input Grammar:

grammar = {
    "list": 'ident ("," ident)*',
    "ident": '[a-z]+'
}

Input String: "abc,def,ghi"

Output:

['abc', [',', 'def'], [',', 'ghi']]

Explanation: The list rule first matches "abc" as ident. Then it matches "," followed by "def" as the next ident. This repeats for "," and "ghi".

Example 3: Handling Choices and Optionality

Grammar for a simple command: command = "QUIT" / ("HELLO" name?), name = [A-Z]+.

Input Grammar:

grammar = {
    "command": '"QUIT" / ("HELLO" name?)',
    "name": '[A-Z]+'
}

Input String 1: "QUIT"

Output:

'QUIT'

Explanation: The command rule tries "QUIT" first, which matches.

Input String 2: "HELLO Alice"

Output:

['HELLO', 'ALICE']

Explanation: The command rule tries "QUIT" (fails). It then tries the second part of the choice: "HELLO" name?. It matches "HELLO", then it matches "ALICE" as name.

Input String 3: "HELLO"

Output:

['HELLO', None]

Explanation: The command rule tries "QUIT" (fails). It then tries "HELLO" name?. It matches "HELLO". The name? part is optional and doesn't match anything, so it results in None.

Constraints

The grammar definition will consist of up to 100 rules.
The input string length will be at most 1000 characters.
Rule names will be alphanumeric strings starting with a letter.
Literal strings will be enclosed in double quotes and can contain alphanumeric characters.
Character classes will be enclosed in square brackets [] and can contain ranges (e.g., a-z) or individual characters.
The parser should aim to complete parsing within a reasonable time for the given constraints, suggesting memoization is beneficial.

Notes

You will need to manage the current position within the input string carefully.
Consider how to represent the parsed output for each PEG primitive. For example, a sequence might be a list of its parsed components, a choice might be the result of the successful branch, and repetitions (*, +) might be lists of parsed sub-results. Literals and character classes can return the matched string.
Think about how to handle failures and report useful error messages.
For the ordered choice (/), remember that the first successful match terminates the choice.
For character classes, you'll need a way to check if a character belongs to the specified class.
This challenge focuses on the core PEG parsing logic. You don't need to implement a full-fledged lexer or abstract syntax tree (AST) builder, but your output structure should be useful for those next steps.

Building a Parsing Expression Grammar (PEG) Parser in Python

Problem Description

Key Requirements:

Rule Definitions: The parser should accept a dictionary where keys are rule names (strings) and values are the PEG expressions representing those rules.
PEG Expression Syntax: Support the following PEG primitives:
- Literal String ("abc"): Matches the exact string.
- Character Class ([a-z] or [abc]): Matches any single character within the class.
- Any Character (.): Matches any single character (except newline by default).
- Sequence (expr1 expr2): Matches expr1 followed by expr2.
- Ordered Choice (expr1 / expr2): Matches expr1 if it succeeds, otherwise tries expr2.
- Zero or More (expr*): Matches expr zero or more times.
- One or More (expr+): Matches expr one or more times.
- Optional (expr?): Matches expr zero or one time.
- Rule Reference (rulename): Matches the result of another defined rule.
Parsing Function: Implement a parse(start_rule, input_string) method that attempts to parse the input_string starting from the start_rule.
Output Structure:
- On successful parsing, return a nested list or tuple structure representing the parsed input. For sequences, elements should be in order. For ordered choices, return the result of the successful branch. For repetitions (*, +), return a list of parsed sub-results. For character classes and literals, return the matched string.
- On failure, raise a custom ParseError exception with information about the failure location and expected tokens.
Memoization (Optional but Recommended): To improve performance for grammars with left recursion or repeated sub-problems, implement memoization.

Expected Behavior:

The parser should consume the input string from left to right.
If a rule or expression matches, the parser should advance its position in the input string.
If a rule or expression fails to match at the current position, the parser should backtrack (unless a successful match has already occurred in an ordered choice).
The parse method should return the complete parse tree if the start_rule successfully parses the entire input string. If the start_rule parses a prefix but leaves unconsumed input, it should be considered a failure.

Edge Cases:

Empty input string.
Grammar with no rules.
Grammar with undefined rule references.
Input string that doesn't match the grammar at all.
Input string that matches a prefix but not the whole string.
Grammars that might exhibit left recursion (though a simple PEG parser won't directly handle it as typically defined, the structure should be robust).

Examples

Example 1: Simple Arithmetic Expression

Let's define a grammar for very simple integer arithmetic: expr = term ('+' term)*, term = '1'.

Input Grammar:

grammar = {
    "expr": 'term ("+" term)*',
    "term": '"1"'
}

Input String: "1+1+1"

Output:

['1', ['+', '1'], ['+', '1']]

Example 2: Basic List Parsing

Grammar for a list of identifiers separated by commas: list = ident ("," ident)*, ident = [a-z]+.

Input Grammar:

grammar = {
    "list": 'ident ("," ident)*',
    "ident": '[a-z]+'
}

Input String: "abc,def,ghi"

Output:

['abc', [',', 'def'], [',', 'ghi']]

Explanation: The list rule first matches "abc" as ident. Then it matches "," followed by "def" as the next ident. This repeats for "," and "ghi".

Example 3: Handling Choices and Optionality

Grammar for a simple command: command = "QUIT" / ("HELLO" name?), name = [A-Z]+.

Input Grammar:

grammar = {
    "command": '"QUIT" / ("HELLO" name?)',
    "name": '[A-Z]+'
}

Input String 1: "QUIT"

Output:

'QUIT'

Explanation: The command rule tries "QUIT" first, which matches.

Input String 2: "HELLO Alice"

Output:

['HELLO', 'ALICE']

Explanation: The command rule tries "QUIT" (fails). It then tries the second part of the choice: "HELLO" name?. It matches "HELLO", then it matches "ALICE" as name.

Input String 3: "HELLO"

Output:

['HELLO', None]

Explanation: The command rule tries "QUIT" (fails). It then tries "HELLO" name?. It matches "HELLO". The name? part is optional and doesn't match anything, so it results in None.

Constraints

The grammar definition will consist of up to 100 rules.
The input string length will be at most 1000 characters.
Rule names will be alphanumeric strings starting with a letter.
Literal strings will be enclosed in double quotes and can contain alphanumeric characters.
Character classes will be enclosed in square brackets [] and can contain ranges (e.g., a-z) or individual characters.
The parser should aim to complete parsing within a reasonable time for the given constraints, suggesting memoization is beneficial.

Notes

You will need to manage the current position within the input string carefully.
Consider how to represent the parsed output for each PEG primitive. For example, a sequence might be a list of its parsed components, a choice might be the result of the successful branch, and repetitions (*, +) might be lists of parsed sub-results. Literals and character classes can return the matched string.
Think about how to handle failures and report useful error messages.
For the ordered choice (/), remember that the first successful match terminates the choice.
For character classes, you'll need a way to check if a character belongs to the specified class.
This challenge focuses on the core PEG parsing logic. You don't need to implement a full-fledged lexer or abstract syntax tree (AST) builder, but your output structure should be useful for those next steps.