Building a Parsing Expression Grammar (PEG) Parser in Python
This challenge asks you to implement a parser for a simplified Parsing Expression Grammar (PEG) in Python. PEGs are a powerful formalism for defining grammars and are often used for tasks like language parsing, configuration file processing, and data serialization. Understanding how to build a PEG parser will equip you with a fundamental tool for many programming tasks.
Problem Description
Your goal is to create a Python class, let's call it PEGParser, that can parse input strings according to a grammar defined using a set of simple PEG rules. The parser should be able to handle basic PEG constructs and return a structured representation of the parsed input, or indicate a parsing failure.
Key Requirements:
- Rule Definitions: The parser should accept a dictionary where keys are rule names (strings) and values are the PEG expressions representing those rules.
- PEG Expression Syntax: Support the following PEG primitives:
- Literal String (
"abc"): Matches the exact string. - Character Class (
[a-z]or[abc]): Matches any single character within the class. - Any Character (
.): Matches any single character (except newline by default). - Sequence (
expr1 expr2): Matchesexpr1followed byexpr2. - Ordered Choice (
expr1 / expr2): Matchesexpr1if it succeeds, otherwise triesexpr2. - Zero or More (
expr*): Matchesexprzero or more times. - One or More (
expr+): Matchesexprone or more times. - Optional (
expr?): Matchesexprzero or one time. - Rule Reference (
rulename): Matches the result of another defined rule.
- Literal String (
- Parsing Function: Implement a
parse(start_rule, input_string)method that attempts to parse theinput_stringstarting from thestart_rule. - Output Structure:
- On successful parsing, return a nested list or tuple structure representing the parsed input. For sequences, elements should be in order. For ordered choices, return the result of the successful branch. For repetitions (
*,+), return a list of parsed sub-results. For character classes and literals, return the matched string. - On failure, raise a custom
ParseErrorexception with information about the failure location and expected tokens.
- On successful parsing, return a nested list or tuple structure representing the parsed input. For sequences, elements should be in order. For ordered choices, return the result of the successful branch. For repetitions (
- Memoization (Optional but Recommended): To improve performance for grammars with left recursion or repeated sub-problems, implement memoization.
Expected Behavior:
- The parser should consume the input string from left to right.
- If a rule or expression matches, the parser should advance its position in the input string.
- If a rule or expression fails to match at the current position, the parser should backtrack (unless a successful match has already occurred in an ordered choice).
- The
parsemethod should return the complete parse tree if thestart_rulesuccessfully parses the entire input string. If thestart_ruleparses a prefix but leaves unconsumed input, it should be considered a failure.
Edge Cases:
- Empty input string.
- Grammar with no rules.
- Grammar with undefined rule references.
- Input string that doesn't match the grammar at all.
- Input string that matches a prefix but not the whole string.
- Grammars that might exhibit left recursion (though a simple PEG parser won't directly handle it as typically defined, the structure should be robust).
Examples
Example 1: Simple Arithmetic Expression
Let's define a grammar for very simple integer arithmetic: expr = term ('+' term)*, term = '1'.
Input Grammar:
grammar = {
"expr": 'term ("+" term)*',
"term": '"1"'
}
Input String: "1+1+1"
Output:
['1', ['+', '1'], ['+', '1']]
Explanation:
The expr rule first matches "1" (as term). Then, it encounters "+", which matches. It then matches another "1" (as term). This process repeats for the remaining "+1" parts. The structure represents the sequence of terms and the optional + term repetitions.
Example 2: Basic List Parsing
Grammar for a list of identifiers separated by commas: list = ident ("," ident)*, ident = [a-z]+.
Input Grammar:
grammar = {
"list": 'ident ("," ident)*',
"ident": '[a-z]+'
}
Input String: "abc,def,ghi"
Output:
['abc', [',', 'def'], [',', 'ghi']]
Explanation:
The list rule first matches "abc" as ident. Then it matches "," followed by "def" as the next ident. This repeats for "," and "ghi".
Example 3: Handling Choices and Optionality
Grammar for a simple command: command = "QUIT" / ("HELLO" name?), name = [A-Z]+.
Input Grammar:
grammar = {
"command": '"QUIT" / ("HELLO" name?)',
"name": '[A-Z]+'
}
Input String 1: "QUIT"
Output:
'QUIT'
Explanation: The command rule tries "QUIT" first, which matches.
Input String 2: "HELLO Alice"
Output:
['HELLO', 'ALICE']
Explanation: The command rule tries "QUIT" (fails). It then tries the second part of the choice: "HELLO" name?. It matches "HELLO", then it matches "ALICE" as name.
Input String 3: "HELLO"
Output:
['HELLO', None]
Explanation: The command rule tries "QUIT" (fails). It then tries "HELLO" name?. It matches "HELLO". The name? part is optional and doesn't match anything, so it results in None.
Constraints
- The grammar definition will consist of up to 100 rules.
- The input string length will be at most 1000 characters.
- Rule names will be alphanumeric strings starting with a letter.
- Literal strings will be enclosed in double quotes and can contain alphanumeric characters.
- Character classes will be enclosed in square brackets
[]and can contain ranges (e.g.,a-z) or individual characters. - The parser should aim to complete parsing within a reasonable time for the given constraints, suggesting memoization is beneficial.
Notes
- You will need to manage the current position within the input string carefully.
- Consider how to represent the parsed output for each PEG primitive. For example, a sequence might be a list of its parsed components, a choice might be the result of the successful branch, and repetitions (
*,+) might be lists of parsed sub-results. Literals and character classes can return the matched string. - Think about how to handle failures and report useful error messages.
- For the ordered choice (
/), remember that the first successful match terminates the choice. - For character classes, you'll need a way to check if a character belongs to the specified class.
- This challenge focuses on the core PEG parsing logic. You don't need to implement a full-fledged lexer or abstract syntax tree (AST) builder, but your output structure should be useful for those next steps.