Hone logo
Hone
Problems

Dataframe Manipulation Challenge: Analyzing Sales Data

This challenge focuses on performing common DataFrame operations using the Pandas library in Python. You'll be provided with a dataset representing sales transactions and tasked with extracting insights through filtering, aggregation, and transformation – skills crucial for data analysis and business intelligence. Successfully completing this challenge demonstrates proficiency in manipulating and analyzing tabular data.

Problem Description

You are given a Pandas DataFrame named sales_data containing information about sales transactions. The DataFrame has the following columns:

  • TransactionID: Unique identifier for each transaction (integer).
  • Date: Date of the transaction (string in 'YYYY-MM-DD' format).
  • Product: Name of the product sold (string).
  • Category: Category of the product (string).
  • Quantity: Number of units sold (integer).
  • UnitPrice: Price per unit (float).
  • CustomerID: Identifier for the customer making the purchase (integer).

Your task is to perform the following operations on the sales_data DataFrame:

  1. Filter Transactions: Select all transactions that occurred in the year 2023.
  2. Calculate Total Revenue: Calculate the total revenue for each product by multiplying Quantity and UnitPrice and summing it up for each product.
  3. Find Top 3 Categories: Determine the top 3 product categories based on total revenue generated.
  4. Average Unit Price by Category: Calculate the average UnitPrice for each product category.
  5. Filter by Quantity: Select all transactions where the Quantity is greater than 5.

Examples

Example 1:

Input: sales_data (DataFrame with sample data - see below for structure)
Output: DataFrame with transactions from 2023 only.
Explanation: Filter the DataFrame based on the 'Date' column, keeping only rows where the year is 2023.

Example 2:

Input: sales_data (DataFrame)
Output: Pandas Series showing total revenue per product.
Explanation: Create a new column 'Revenue' by multiplying 'Quantity' and 'UnitPrice'. Then, group by 'Product' and sum the 'Revenue' column.

Example 3:

Input: sales_data (DataFrame)
Output: Pandas Series showing the top 3 categories by total revenue.
Explanation: Calculate total revenue per category (group by 'Category', sum 'Revenue'). Sort the categories by total revenue in descending order and select the top 3.

Sample sales_data DataFrame Structure (for testing):

TransactionID  Date       Product      Category  Quantity  UnitPrice  CustomerID
0              1  2022-12-25  Laptop      Electronics  1         1200.00         101
1              2  2023-01-10  Keyboard     Accessories  2          75.00         102
2              3  2023-02-15  Mouse        Accessories  3          25.00         101
3              4  2022-11-05  Monitor      Electronics  1          300.00         103
4              5  2023-03-20  Tablet       Electronics  1          450.00         102
5              6  2023-04-01  Webcam       Accessories  4          50.00         103
6              7  2023-05-12  Laptop      Electronics  1         1250.00         101
7              8  2022-10-28  Printer      Electronics  1          200.00         102
8              9  2023-06-05  Mouse        Accessories  5          20.00         103
9             10  2023-07-18  Keyboard     Accessories  1          80.00         101

Constraints

  • The sales_data DataFrame will always contain at least 10 rows.
  • The 'Date' column will always be in the format 'YYYY-MM-DD'.
  • All numerical columns (Quantity, UnitPrice) will contain only positive integers or floats.
  • The solution should be efficient and avoid unnecessary iterations. Vectorized operations are preferred.
  • The solution must use Pandas DataFrame operations. Avoid explicit looping through rows.

Notes

  • You will need to import the Pandas library.
  • Consider using Pandas functions like query, groupby, sum, sort_values, head, and mean to solve the problem.
  • Pay close attention to data types when performing calculations.
  • The order of the top 3 categories in the output is not important.
  • The provided sample DataFrame is for testing purposes only. Your solution should work correctly with any valid sales_data DataFrame.
  • Ensure your code is well-documented and easy to understand.
Loading editor...
python