Python Analytics Engine: Log Processor and Dashboard Generator
This challenge involves building a Python-based analytics engine capable of processing raw log data, extracting meaningful insights, and presenting these insights in a digestible format. This is a fundamental skill for understanding application behavior, identifying trends, and troubleshooting issues in real-world systems.
Problem Description
You are tasked with creating a Python program that acts as an analytics engine. This engine will ingest a list of log entries, each representing an event with associated data like timestamp, event type, user ID, and potentially other payload information. Your program should be able to perform several analytical tasks on this data.
Key Requirements:
- Log Parsing: The engine must be able to parse raw log entries. Assume log entries are strings, but you'll need to define a consistent format for them. For this challenge, we'll assume each log entry is a comma-separated string in the format:
timestamp,event_type,user_id,payload. Thepayloadcan be a JSON string or an empty string. - Data Aggregation:
- Count the total number of log entries.
- Count the occurrences of each distinct
event_type. - Calculate the number of unique users who generated events.
- For specific event types (e.g., "purchase"), aggregate related data from the
payload(e.g., sum ofamount).
- Time-Based Analysis (Optional but Recommended): Group events by hour or day to understand activity patterns.
- Output Generation: Present the aggregated analytics in a clear, structured format. This could be a dictionary, a formatted string, or a list of summary statistics.
Expected Behavior:
The program should accept a list of log strings as input and return a dictionary containing the computed analytics.
Edge Cases to Consider:
- Malformed log entries (e.g., missing commas, incorrect number of fields). Your parser should handle these gracefully, perhaps by skipping them or logging an error.
- Empty
payloadfor some entries. payloaddata that is not valid JSON.- Empty input list of logs.
Examples
Example 1:
Input:
logs = [
"2023-10-27T10:00:00Z,login,user123,{}",
"2023-10-27T10:05:00Z,view_page,user456,{\"page_id\": \"home\"}",
"2023-10-27T10:10:00Z,login,user789,{}",
"2023-10-27T10:15:00Z,purchase,user123,{\"item_id\": \"A\", \"amount\": 10.50}",
"2023-10-27T10:20:00Z,view_page,user123,{\"page_id\": \"products\"}",
"2023-10-27T10:25:00Z,purchase,user456,{\"item_id\": \"B\", \"amount\": 25.00}",
"2023-10-27T11:00:00Z,login,user123,{}",
]
Output:
{
"total_logs": 7,
"event_counts": {
"login": 3,
"view_page": 2,
"purchase": 2
},
"unique_users": 3,
"purchase_summary": {
"total_purchases": 2,
"total_revenue": 35.50
}
}
Explanation: The output shows the total number of logs processed, the count for each event type, the total number of distinct users, and a summary of purchase events including the total number of purchases and the sum of their amounts.
Example 2:
Input:
logs = [
"2023-10-27T09:00:00Z,logout,user123,{}",
"invalid_log_entry", # Malformed entry
"2023-10-27T09:05:00Z,login,user456,{}",
]
Output:
{
"total_logs": 3, # Includes the invalid entry in total for simplicity of this example, but ideally it should be excluded from calculations or handled separately. For this challenge, let's count all entries *attempted*.
"event_counts": {
"logout": 1,
"login": 1
},
"unique_users": 2,
"purchase_summary": { # No purchase events, so empty or default.
"total_purchases": 0,
"total_revenue": 0.00
}
}
Explanation: The invalid log entry is present in the total count, but doesn't contribute to event counts or user uniqueness.
Example 3: Empty Input
Input:
logs = []
Output:
{
"total_logs": 0,
"event_counts": {},
"unique_users": 0,
"purchase_summary": {
"total_purchases": 0,
"total_revenue": 0.00
}
}
Explanation: An empty input list results in zero counts for all metrics.
Constraints
- The input
logswill be a list of strings. - Each log string, if valid, will adhere to the format
timestamp,event_type,user_id,payload. - Timestamps will be in ISO 8601 format (e.g.,
2023-10-27T10:00:00Z). event_typeanduser_idwill be strings.payloadwill be a JSON string or an empty string ({}).- The
purchaseevent's payload will contain anamountkey with a numeric value (float or int). - The total number of log entries will not exceed 1,000,000.
- The number of unique users will not exceed 100,000.
- Your solution should aim to process logs efficiently, ideally with a time complexity that scales linearly with the number of log entries (O(N)).
Notes
- You will need to import Python's
jsonmodule to parse thepayloadstrings. - Consider using
collections.defaultdictorcollections.Counterfor easier aggregation. - Decide on a clear strategy for handling malformed log entries. For this challenge, it's acceptable to simply skip them from further analysis and optionally note them, but they should still be counted towards the
total_logsprocessed. - The
purchase_summaryis a specific example of payload analysis; you can extend this concept to other event types if you wish, but it's not strictly required by the prompt. - The focus is on building a robust and extensible analytics processing function.