Python Data Aggregation Challenge
This challenge focuses on implementing fundamental data aggregation techniques in Python. You'll be tasked with processing a list of dictionaries, performing calculations like summing, counting, and averaging values based on specific criteria. This is a crucial skill for data analysis and manipulation.
Problem Description
You are given a list of dictionaries, where each dictionary represents a record with various attributes. Your goal is to write a Python function that aggregates this data based on a specified key and returns a summary dictionary. The aggregation should perform a count of records and calculate the sum of a specified numerical value for each unique key value.
Key Requirements:
-
Function Signature: Implement a function named
aggregate_datathat accepts three arguments:data: A list of dictionaries.group_by_key: A string representing the key to group the data by.sum_key: A string representing the key whose values should be summed.
-
Aggregation Logic:
- Iterate through the
datalist. - For each dictionary, identify the value associated with
group_by_key. This value will be used as the key in the output dictionary. - For each unique
group_by_keyvalue, maintain:- A count of how many records share that
group_by_keyvalue. - The sum of the values associated with
sum_keyfor all records sharing thatgroup_by_keyvalue.
- A count of how many records share that
- Iterate through the
-
Output Format: The function should return a dictionary where:
- Keys are the unique values found in
group_by_key. - Values are themselves dictionaries, containing two keys:
count: The total number of records for that group.sum: The sum ofsum_keyvalues for that group.
- Keys are the unique values found in
Expected Behavior:
- If
group_by_keyorsum_keydo not exist in a dictionary, that dictionary should be skipped for the purpose of aggregation for that specific key. - If
sum_keyhas a non-numeric value for a record, it should be treated as 0 for the sum calculation for that record.
Edge Cases:
- Empty Input List: If
datais an empty list, the function should return an empty dictionary. - Missing Keys: Handle cases where
group_by_keyorsum_keymight be missing from some dictionaries in the input list. - Non-numeric
sum_keyvalues: Ensure robust handling of non-numeric data insum_key.
Examples
Example 1:
Input:
data = [
{"category": "A", "value": 10},
{"category": "B", "value": 20},
{"category": "A", "value": 15},
{"category": "C", "value": 5},
{"category": "B", "value": 25}
]
group_by_key = "category"
sum_key = "value"
Output:
{
"A": {"count": 2, "sum": 25},
"B": {"count": 2, "sum": 45},
"C": {"count": 1, "sum": 5}
}
Explanation:
Data is grouped by "category".
- Category "A" appears twice with values 10 and 15, so count is 2 and sum is 25.
- Category "B" appears twice with values 20 and 25, so count is 2 and sum is 45.
- Category "C" appears once with value 5, so count is 1 and sum is 5.
Example 2:
Input:
data = [
{"product": "apple", "quantity": 5},
{"product": "banana", "quantity": 10},
{"product": "apple", "quantity": 3},
{"product": "orange", "quantity": 7},
{"product": "banana", "quantity": 12},
{"product": "apple", "quantity": "invalid"} # Non-numeric value
]
group_by_key = "product"
sum_key = "quantity"
Output:
{
"apple": {"count": 3, "sum": 8},
"banana": {"count": 2, "sum": 22},
"orange": {"count": 1, "sum": 7}
}
Explanation:
Data is grouped by "product".
- For "apple", count is 3. The values are 5, 3, and "invalid". "invalid" is treated as 0, so sum is 5 + 3 + 0 = 8.
- For "banana", count is 2 with values 10 and 12, sum is 22.
- For "orange", count is 1 with value 7, sum is 7.
Example 3: (Edge Case - Missing Keys)
Input:
data = [
{"type": "X", "amount": 100},
{"type": "Y", "price": 50}, # Missing "amount"
{"type": "X", "amount": 200, "extra": "data"},
{"kind": "Z", "amount": 300}, # Missing "type"
{"type": "Y", "amount": 150}
]
group_by_key = "type"
sum_key = "amount"
Output:
{
"X": {"count": 2, "sum": 300},
"Y": {"count": 1, "sum": 150}
}
Explanation:
- Records with "type": "X" have "amount" 100 and 200, count 2, sum 300.
- Records with "type": "Y" have "amount" 150, count 1, sum 150.
- The record {"type": "Y", "price": 50} is skipped for sum calculation because "amount" is missing.
- The record {"kind": "Z", "amount": 300} is skipped for grouping because "type" is missing.
Constraints
- The
datalist can contain up to 1000 dictionaries. - The
group_by_keyandsum_keystrings will be valid dictionary keys. - The values for
group_by_keywill be strings or numbers. - The values for
sum_keycan be integers, floats, or strings. - Your solution should aim for a time complexity of O(N), where N is the number of dictionaries in the
datalist.
Notes
- Consider using Python's built-in data structures like dictionaries to efficiently store and update aggregation results.
- Think about how to handle potential
KeyErrorexceptions when accessing dictionary elements. - The "sum" calculation should gracefully handle non-numeric
sum_keyvalues by treating them as zero.