Hone logo
Hone
Problems

Python Analytics Engine: Log Processor and Dashboard Generator

This challenge involves building a Python-based analytics engine capable of processing raw log data, extracting meaningful insights, and presenting these insights in a digestible format. This is a fundamental skill for understanding application behavior, identifying trends, and troubleshooting issues in real-world systems.

Problem Description

You are tasked with creating a Python program that acts as an analytics engine. This engine will ingest a list of log entries, each representing an event with associated data like timestamp, event type, user ID, and potentially other payload information. Your program should be able to perform several analytical tasks on this data.

Key Requirements:

  1. Log Parsing: The engine must be able to parse raw log entries. Assume log entries are strings, but you'll need to define a consistent format for them. For this challenge, we'll assume each log entry is a comma-separated string in the format: timestamp,event_type,user_id,payload. The payload can be a JSON string or an empty string.
  2. Data Aggregation:
    • Count the total number of log entries.
    • Count the occurrences of each distinct event_type.
    • Calculate the number of unique users who generated events.
    • For specific event types (e.g., "purchase"), aggregate related data from the payload (e.g., sum of amount).
  3. Time-Based Analysis (Optional but Recommended): Group events by hour or day to understand activity patterns.
  4. Output Generation: Present the aggregated analytics in a clear, structured format. This could be a dictionary, a formatted string, or a list of summary statistics.

Expected Behavior:

The program should accept a list of log strings as input and return a dictionary containing the computed analytics.

Edge Cases to Consider:

  • Malformed log entries (e.g., missing commas, incorrect number of fields). Your parser should handle these gracefully, perhaps by skipping them or logging an error.
  • Empty payload for some entries.
  • payload data that is not valid JSON.
  • Empty input list of logs.

Examples

Example 1:

Input:

logs = [
    "2023-10-27T10:00:00Z,login,user123,{}",
    "2023-10-27T10:05:00Z,view_page,user456,{\"page_id\": \"home\"}",
    "2023-10-27T10:10:00Z,login,user789,{}",
    "2023-10-27T10:15:00Z,purchase,user123,{\"item_id\": \"A\", \"amount\": 10.50}",
    "2023-10-27T10:20:00Z,view_page,user123,{\"page_id\": \"products\"}",
    "2023-10-27T10:25:00Z,purchase,user456,{\"item_id\": \"B\", \"amount\": 25.00}",
    "2023-10-27T11:00:00Z,login,user123,{}",
]

Output:

{
    "total_logs": 7,
    "event_counts": {
        "login": 3,
        "view_page": 2,
        "purchase": 2
    },
    "unique_users": 3,
    "purchase_summary": {
        "total_purchases": 2,
        "total_revenue": 35.50
    }
}

Explanation: The output shows the total number of logs processed, the count for each event type, the total number of distinct users, and a summary of purchase events including the total number of purchases and the sum of their amounts.

Example 2:

Input:

logs = [
    "2023-10-27T09:00:00Z,logout,user123,{}",
    "invalid_log_entry", # Malformed entry
    "2023-10-27T09:05:00Z,login,user456,{}",
]

Output:

{
    "total_logs": 3, # Includes the invalid entry in total for simplicity of this example, but ideally it should be excluded from calculations or handled separately. For this challenge, let's count all entries *attempted*.
    "event_counts": {
        "logout": 1,
        "login": 1
    },
    "unique_users": 2,
    "purchase_summary": { # No purchase events, so empty or default.
        "total_purchases": 0,
        "total_revenue": 0.00
    }
}

Explanation: The invalid log entry is present in the total count, but doesn't contribute to event counts or user uniqueness.

Example 3: Empty Input

Input:

logs = []

Output:

{
    "total_logs": 0,
    "event_counts": {},
    "unique_users": 0,
    "purchase_summary": {
        "total_purchases": 0,
        "total_revenue": 0.00
    }
}

Explanation: An empty input list results in zero counts for all metrics.

Constraints

  • The input logs will be a list of strings.
  • Each log string, if valid, will adhere to the format timestamp,event_type,user_id,payload.
  • Timestamps will be in ISO 8601 format (e.g., 2023-10-27T10:00:00Z).
  • event_type and user_id will be strings.
  • payload will be a JSON string or an empty string ({}).
  • The purchase event's payload will contain an amount key with a numeric value (float or int).
  • The total number of log entries will not exceed 1,000,000.
  • The number of unique users will not exceed 100,000.
  • Your solution should aim to process logs efficiently, ideally with a time complexity that scales linearly with the number of log entries (O(N)).

Notes

  • You will need to import Python's json module to parse the payload strings.
  • Consider using collections.defaultdict or collections.Counter for easier aggregation.
  • Decide on a clear strategy for handling malformed log entries. For this challenge, it's acceptable to simply skip them from further analysis and optionally note them, but they should still be counted towards the total_logs processed.
  • The purchase_summary is a specific example of payload analysis; you can extend this concept to other event types if you wish, but it's not strictly required by the prompt.
  • The focus is on building a robust and extensible analytics processing function.
Loading editor...
python