Hone logo
Hone
Problems

Python Data Aggregation Challenge

This challenge focuses on implementing fundamental data aggregation techniques in Python. You'll be tasked with processing a list of dictionaries, performing calculations like summing, counting, and averaging values based on specific criteria. This is a crucial skill for data analysis and manipulation.

Problem Description

You are given a list of dictionaries, where each dictionary represents a record with various attributes. Your goal is to write a Python function that aggregates this data based on a specified key and returns a summary dictionary. The aggregation should perform a count of records and calculate the sum of a specified numerical value for each unique key value.

Key Requirements:

  1. Function Signature: Implement a function named aggregate_data that accepts three arguments:

    • data: A list of dictionaries.
    • group_by_key: A string representing the key to group the data by.
    • sum_key: A string representing the key whose values should be summed.
  2. Aggregation Logic:

    • Iterate through the data list.
    • For each dictionary, identify the value associated with group_by_key. This value will be used as the key in the output dictionary.
    • For each unique group_by_key value, maintain:
      • A count of how many records share that group_by_key value.
      • The sum of the values associated with sum_key for all records sharing that group_by_key value.
  3. Output Format: The function should return a dictionary where:

    • Keys are the unique values found in group_by_key.
    • Values are themselves dictionaries, containing two keys:
      • count: The total number of records for that group.
      • sum: The sum of sum_key values for that group.

Expected Behavior:

  • If group_by_key or sum_key do not exist in a dictionary, that dictionary should be skipped for the purpose of aggregation for that specific key.
  • If sum_key has a non-numeric value for a record, it should be treated as 0 for the sum calculation for that record.

Edge Cases:

  • Empty Input List: If data is an empty list, the function should return an empty dictionary.
  • Missing Keys: Handle cases where group_by_key or sum_key might be missing from some dictionaries in the input list.
  • Non-numeric sum_key values: Ensure robust handling of non-numeric data in sum_key.

Examples

Example 1:

Input:
data = [
    {"category": "A", "value": 10},
    {"category": "B", "value": 20},
    {"category": "A", "value": 15},
    {"category": "C", "value": 5},
    {"category": "B", "value": 25}
]
group_by_key = "category"
sum_key = "value"

Output:
{
    "A": {"count": 2, "sum": 25},
    "B": {"count": 2, "sum": 45},
    "C": {"count": 1, "sum": 5}
}
Explanation:
Data is grouped by "category".
- Category "A" appears twice with values 10 and 15, so count is 2 and sum is 25.
- Category "B" appears twice with values 20 and 25, so count is 2 and sum is 45.
- Category "C" appears once with value 5, so count is 1 and sum is 5.

Example 2:

Input:
data = [
    {"product": "apple", "quantity": 5},
    {"product": "banana", "quantity": 10},
    {"product": "apple", "quantity": 3},
    {"product": "orange", "quantity": 7},
    {"product": "banana", "quantity": 12},
    {"product": "apple", "quantity": "invalid"} # Non-numeric value
]
group_by_key = "product"
sum_key = "quantity"

Output:
{
    "apple": {"count": 3, "sum": 8},
    "banana": {"count": 2, "sum": 22},
    "orange": {"count": 1, "sum": 7}
}
Explanation:
Data is grouped by "product".
- For "apple", count is 3. The values are 5, 3, and "invalid". "invalid" is treated as 0, so sum is 5 + 3 + 0 = 8.
- For "banana", count is 2 with values 10 and 12, sum is 22.
- For "orange", count is 1 with value 7, sum is 7.

Example 3: (Edge Case - Missing Keys)

Input:
data = [
    {"type": "X", "amount": 100},
    {"type": "Y", "price": 50}, # Missing "amount"
    {"type": "X", "amount": 200, "extra": "data"},
    {"kind": "Z", "amount": 300}, # Missing "type"
    {"type": "Y", "amount": 150}
]
group_by_key = "type"
sum_key = "amount"

Output:
{
    "X": {"count": 2, "sum": 300},
    "Y": {"count": 1, "sum": 150}
}
Explanation:
- Records with "type": "X" have "amount" 100 and 200, count 2, sum 300.
- Records with "type": "Y" have "amount" 150, count 1, sum 150.
- The record {"type": "Y", "price": 50} is skipped for sum calculation because "amount" is missing.
- The record {"kind": "Z", "amount": 300} is skipped for grouping because "type" is missing.

Constraints

  • The data list can contain up to 1000 dictionaries.
  • The group_by_key and sum_key strings will be valid dictionary keys.
  • The values for group_by_key will be strings or numbers.
  • The values for sum_key can be integers, floats, or strings.
  • Your solution should aim for a time complexity of O(N), where N is the number of dictionaries in the data list.

Notes

  • Consider using Python's built-in data structures like dictionaries to efficiently store and update aggregation results.
  • Think about how to handle potential KeyError exceptions when accessing dictionary elements.
  • The "sum" calculation should gracefully handle non-numeric sum_key values by treating them as zero.
Loading editor...
python