Streaming SQL Query Results for Real-Time Processing
Modern applications often need to process large datasets in real-time. Instead of waiting for an entire query result to materialize before processing, streaming allows you to handle results incrementally as they become available. This challenge focuses on designing a system that simulates streaming query results from a SQL database, enabling immediate processing of data chunks.
Problem Description
You are tasked with designing a system that simulates streaming query results. The system should take a SQL query string as input and return a generator (or equivalent streaming mechanism in your chosen language) that yields rows of the query result one at a time, or in small batches. The system does not need to actually connect to a database; it should simulate the behavior of a database that would stream results. The focus is on the streaming logic, not database connectivity.
Key Requirements:
- Query Simulation: The system should be able to parse a simplified SQL query (SELECT statement only, no joins, aggregations, or complex clauses) and generate a sequence of rows that would be returned by that query. Assume the query is always valid and well-formed for this simulation.
- Streaming Output: The system must return a generator (or equivalent streaming construct) that yields rows as they are "produced" by the simulated query. Each row should be represented as a dictionary (or similar key-value structure) where keys are column names and values are the corresponding data.
- Batching (Optional): The system can optionally batch rows together before yielding them. This can improve efficiency in some scenarios.
- Error Handling (Simplified): For this challenge, assume the query is always valid. No error handling for invalid SQL is required.
Expected Behavior:
The system should take a SQL query string and a batch size (optional) as input. It should then return a generator that yields rows from the simulated query result. The generator should produce rows until the simulated query is exhausted.
Edge Cases to Consider:
- Empty Result Set: The query might return no rows. The generator should yield nothing in this case.
- Single Row Result: The query might return only one row. The generator should yield that single row.
- Large Result Set: The query might return a very large number of rows. The generator should handle this efficiently without loading the entire result set into memory.
- Batch Size: If a batch size is specified, the generator should yield rows in batches of the specified size (or fewer for the last batch).
Examples
Example 1:
Input: query="SELECT id, name FROM users WHERE id > 100", batch_size=2
Output: (generator yielding dictionaries)
{'id': 101, 'name': 'Alice'}
{'id': 102, 'name': 'Bob'}
{'id': 103, 'name': 'Charlie'}
...
Explanation: The query selects `id` and `name` from the `users` table where `id` is greater than 100. The generator yields rows in batches of 2.
Example 2:
Input: query="SELECT product_id, price FROM products WHERE category = 'electronics'", batch_size=1
Output: (generator yielding dictionaries)
{'product_id': 1, 'price': 99.99}
{'product_id': 2, 'price': 199.99}
{'product_id': 3, 'price': 299.99}
...
Explanation: The query selects `product_id` and `price` from the `products` table where `category` is 'electronics'. The generator yields rows one at a time.
Example 3:
Input: query="SELECT city FROM cities WHERE population < 10000", batch_size=5
Output: (generator yielding dictionaries)
{'city': 'Springfield'}
{'city': 'Harmony'}
{'city': 'Willow Creek'}
...
Explanation: The query selects `city` from the `cities` table where `population` is less than 10000. The generator yields rows in batches of 5.
Constraints
- Query Complexity: The SQL query will be a simple
SELECTstatement with aWHEREclause (optional). No joins, aggregations, or subqueries are allowed. TheWHEREclause will only contain equality comparisons. - Data Types: Assume all data values are strings or numbers.
- Batch Size: The
batch_sizeparameter (if provided) will be a positive integer. - Memory Usage: The system should not load the entire result set into memory. It should stream the results incrementally.
- Performance: The generator should yield rows efficiently. Avoid unnecessary computations or data structures.
Notes
- You do not need to implement a full SQL parser. You can hardcode a few sample datasets and queries for testing purposes. The focus is on the streaming logic.
- Consider using a generator (or equivalent streaming construct) to yield rows one at a time or in batches.
- Think about how to simulate the behavior of a database that would stream results. You can use a list or other data structure to represent the simulated result set.
- The "query" string is primarily for demonstration and to guide the simulated data generation. You don't need to actually parse it to determine the columns. You can use the query string to determine the column names to use in the generated dictionaries.
- The goal is to demonstrate the concept of streaming query results, not to build a production-ready SQL engine.