Python Application Monitoring System
This challenge involves building a simplified monitoring system for a hypothetical Python application. You will create a mechanism to track key performance indicators (KPIs) of different application components and provide insights into their operational status. This is crucial for understanding application health, identifying bottlenecks, and ensuring a smooth user experience.
Problem Description
Your task is to develop a Python class, Monitor, that can record, aggregate, and report on various metrics from different parts of an application. The Monitor should be able to:
- Record Events: Log the occurrence of specific events with associated timestamps and optional metadata.
- Track Metrics: Maintain counts and durations for different types of operations or statuses.
- Aggregate Data: Provide methods to retrieve summarized information about recorded events and tracked metrics.
- Report Status: Generate a summary report indicating the health or status of monitored components based on predefined rules or thresholds.
Key Requirements:
- Event Logging: Implement a method to log events. Each event should have a
timestamp, anevent_type(e.g., "request_processed", "error_occurred", "database_query"), and an optionalmetadatadictionary. - Metric Tracking: Implement methods to:
- Increment a counter for a given metric (e.g.,
increment_metric("successful_requests")). - Record the duration of an operation for a given metric (e.g.,
record_duration("api_latency", 0.15)).
- Increment a counter for a given metric (e.g.,
- Data Aggregation:
- Retrieve all logged events within a specified time range.
- Get the current count for a given metric.
- Get the average, minimum, and maximum duration for a given metric.
- Status Reporting: Implement a method to generate a status report. This report should include:
- Total number of events logged.
- Counts for specific critical event types (e.g., "error_occurred").
- Average API latency.
- A simple "status" (e.g., "OK", "WARNING", "ERROR") based on thresholds you define (e.g., if error count exceeds a certain number, status is "ERROR").
Expected Behavior:
The Monitor class should behave as a central hub for all monitoring data. When methods are called, data should be stored efficiently. When reporting methods are called, the aggregated and processed data should be returned in a clear and usable format.
Edge Cases:
- Handling requests for metrics that have not yet been recorded.
- Handling requests for event data when no events have been logged.
- Ensuring timestamp accuracy.
- Handling durations that are zero or negative (though typically durations should be positive).
Examples
Example 1:
from datetime import datetime, timedelta
# Initialize the monitor
monitor = Monitor()
# Log some events
monitor.log_event("request_received", metadata={"user_id": "user123"})
monitor.log_event("database_query", metadata={"query_time": 0.05})
monitor.log_event("request_processed", metadata={"status_code": 200})
monitor.log_event("error_occurred", metadata={"error_type": "FileNotFound", "message": "config.json not found"})
# Track some metrics
monitor.increment_metric("total_requests")
monitor.increment_metric("total_requests")
monitor.record_duration("api_latency", 0.15)
monitor.record_duration("api_latency", 0.20)
# Get aggregated data
print(f"Total requests: {monitor.get_metric_count('total_requests')}")
print(f"Average API latency: {monitor.get_average_duration('api_latency'):.2f}s")
print(f"Events in the last minute: {len(monitor.get_events_in_range(datetime.now() - timedelta(minutes=1)))}")
# Generate a status report
print("\n--- Status Report ---")
report = monitor.generate_report()
for key, value in report.items():
print(f"{key}: {value}")
Total requests: 2
Average API latency: 0.18s
Events in the last minute: 4
--- Status Report ---
total_events: 4
error_occurred: 1
api_latency_avg_s: 0.18
status: WARNING
Explanation: The monitor records four events. It tracks that "total_requests" was incremented twice and records two API latency durations. The average API latency is calculated. The status report shows all events, the count of critical "error_occurred" events, the average API latency, and a "WARNING" status because an error occurred.
Example 2:
from datetime import datetime, timedelta
monitor = Monitor()
# Simulate a period of errors
monitor.log_event("error_occurred", metadata={"error_type": "DatabaseConnectionError"})
monitor.log_event("error_occurred", metadata={"error_type": "DatabaseConnectionError"})
monitor.log_event("error_occurred", metadata={"error_type": "DatabaseConnectionError"})
monitor.log_event("error_occurred", metadata={"error_type": "DatabaseConnectionError"})
monitor.log_event("error_occurred", metadata={"error_type": "DatabaseConnectionError"})
# Add some normal activity
monitor.log_event("request_processed")
print("\n--- Status Report (High Errors) ---")
report = monitor.generate_report()
for key, value in report.items():
print(f"{key}: {value}")
--- Status Report (High Errors) ---
total_events: 6
error_occurred: 5
api_latency_avg_s: N/A
status: ERROR
Explanation: In this scenario, many "error_occurred" events are logged. The status report reflects this with a high count for "error_occurred" and consequently sets the overall status to "ERROR". "api_latency_avg_s" is shown as "N/A" because no API latency was recorded.
Constraints
- The
Monitorclass should be implemented in a single Python file. - Timestamps should be stored using Python's
datetimeobjects. - Metric durations should be stored as floating-point numbers representing seconds.
- The
Monitorclass should be thread-safe if multiple threads might access it concurrently (consider using locks). - The
generate_reportmethod should have predefined thresholds for determining the "status" field. For instance:- If
error_occurredcount > 5, status is "ERROR". - If
error_occurredcount > 2, status is "WARNING". - Otherwise, status is "OK".
- If no
api_latencydata exists, display "N/A".
- If
Notes
- Consider using
collections.defaultdictorcollections.Counterfor efficient metric tracking. - For event storage, a simple list or a more optimized data structure like a deque might be suitable, depending on how you plan to retrieve events by time range.
- The
generate_reportmethod's status logic can be extended or made configurable. For this challenge, hardcoding the thresholds is acceptable. - Think about how to handle the case where no data exists for a particular metric (e.g., average duration when no durations have been recorded).
- The
metadatafor events can be any dictionary. You don't need to pre-define its structure.