User Activity Analysis: 30-Day Summary
This challenge focuses on analyzing user activity data to generate a summary report for the past 30 days. Understanding user engagement patterns is crucial for product improvement, targeted marketing, and overall business strategy. You will be provided with a dataset of user activity logs and tasked with calculating key metrics like total active users, average daily activity, and most active day.
Problem Description
You are given a dataset representing user activity logs. Each log entry contains a user_id, a timestamp (representing when the activity occurred), and an activity_type (e.g., "login", "post", "comment"). Your task is to analyze this data and generate a summary report for the past 30 days (including today). The report should include the following:
- Total Active Users: The number of unique users who performed any activity within the last 30 days.
- Average Daily Activity: The average number of activity logs per day over the last 30 days.
- Most Active Day: The date with the highest number of activity logs within the last 30 days.
Key Requirements:
- The timestamp should be treated as a date (ignoring time).
- The 30-day window should be calculated from today's date.
- Handle cases where the input data is empty.
- Handle cases where no activity occurred within the last 30 days.
Expected Behavior:
The function should take the activity log data as input and return a dictionary (or similar data structure) containing the calculated metrics. The dictionary should have the following keys: "total_active_users", "average_daily_activity", and "most_active_day".
Edge Cases to Consider:
- Empty input data.
- No activity within the last 30 days.
- Data spanning multiple years (ensure only the last 30 days are considered).
- Large datasets (consider efficiency).
Examples
Example 1:
Input: [
{"user_id": 1, "timestamp": "2024-01-20", "activity_type": "login"},
{"user_id": 2, "timestamp": "2024-01-21", "activity_type": "post"},
{"user_id": 1, "timestamp": "2024-01-22", "activity_type": "comment"},
{"user_id": 3, "timestamp": "2024-01-23", "activity_type": "login"},
{"user_id": 2, "timestamp": "2024-01-24", "activity_type": "post"}
]
(Assuming today's date is 2024-01-25)
Output: {
"total_active_users": 3,
"average_daily_activity": 1.0,
"most_active_day": "2024-01-25"
}
Explanation: 3 unique users were active. The average daily activity is 1 (5 activities / 5 days). Today (2024-01-25) is the most active day (as it's the current day).
Example 2:
Input: []
(Assuming today's date is 2024-01-25)
Output: {
"total_active_users": 0,
"average_daily_activity": 0.0,
"most_active_day": "2024-01-25"
}
Explanation: No users were active. Average daily activity is 0. Today is still the most active day.
Example 3: (Edge Case - No activity in the last 30 days)
Input: [
{"user_id": 1, "timestamp": "2023-12-20", "activity_type": "login"},
{"user_id": 2, "timestamp": "2023-12-21", "activity_type": "post"}
]
(Assuming today's date is 2024-01-25)
Output: {
"total_active_users": 0,
"average_daily_activity": 0.0,
"most_active_day": "2024-01-25"
}
Explanation: No activity occurred within the last 30 days.
Constraints
- Input Data Size: The input list can contain up to 10,000 activity log entries.
- Timestamp Format: Timestamps are provided in "YYYY-MM-DD" format.
- Performance: The solution should complete within 1 second for the given input size.
- Date Range: The 30-day window is calculated relative to the current date.
Notes
- Consider using appropriate data structures (e.g., sets for unique users, dictionaries for daily activity counts) to optimize performance.
- You may need to parse the timestamp strings into date objects for easier manipulation.
- The "most_active_day" should be the date with the highest activity count within the 30-day window. If multiple days have the same highest activity count, return the most recent date.
- Assume today's date is always in the future relative to any timestamp in the input data.
- Focus on clarity and readability in your pseudocode.
- Pseudocode should clearly outline the steps involved in calculating each metric.
- Error handling is not explicitly required, but consider how your pseudocode would handle unexpected input formats.
Pseudocode:
FUNCTION analyze_user_activity(activity_logs):
// Get today's date
today = CURRENT_DATE
// Initialize data structures
active_users = SET()
daily_activity = DICTIONARY()
most_active_day = today
max_activity = 0
// Iterate through the activity logs
FOR EACH log IN activity_logs:
// Convert timestamp to date
log_date = DATE(log["timestamp"])
// Check if the log is within the last 30 days
IF log_date >= today - 30 DAYS:
// Add user to the set of active users
active_users.add(log["user_id"])
// Increment daily activity count
IF log_date IN daily_activity:
daily_activity[log_date] = daily_activity[log_date] + 1
ELSE:
daily_activity[log_date] = 1
// Update most active day if necessary
IF daily_activity[log_date] > max_activity:
max_activity = daily_activity[log_date]
most_active_day = log_date
ELSE IF daily_activity[log_date] == max_activity AND log_date > most_active_day:
most_active_day = log_date
// Calculate average daily activity
num_days = 0
total_activity = 0
FOR date IN daily_activity:
num_days = num_days + 1
total_activity = total_activity + daily_activity[date]
IF num_days > 0:
average_daily_activity = total_activity / num_days
ELSE:
average_daily_activity = 0.0
// Prepare the result
result = {
"total_active_users": SIZE(active_users),
"average_daily_activity": average_daily_activity,
"most_active_day": most_active_day
}
RETURN result