Dataframe Manipulation Challenge: Analyzing Sales Data
This challenge focuses on performing common DataFrame operations using the Pandas library in Python. You'll be provided with a dataset representing sales transactions and tasked with extracting insights through filtering, aggregation, and transformation – skills crucial for data analysis and business intelligence. Successfully completing this challenge demonstrates proficiency in manipulating and analyzing tabular data.
Problem Description
You are given a Pandas DataFrame named sales_data containing information about sales transactions. The DataFrame has the following columns:
TransactionID: Unique identifier for each transaction (integer).Date: Date of the transaction (string in 'YYYY-MM-DD' format).Product: Name of the product sold (string).Category: Category of the product (string).Quantity: Number of units sold (integer).UnitPrice: Price per unit (float).CustomerID: Identifier for the customer making the purchase (integer).
Your task is to perform the following operations on the sales_data DataFrame:
- Filter Transactions: Select all transactions that occurred in the year 2023.
- Calculate Total Revenue: Calculate the total revenue for each product by multiplying
QuantityandUnitPriceand summing it up for each product. - Find Top 3 Categories: Determine the top 3 product categories based on total revenue generated.
- Average Unit Price by Category: Calculate the average
UnitPricefor each product category. - Filter by Quantity: Select all transactions where the
Quantityis greater than 5.
Examples
Example 1:
Input: sales_data (DataFrame with sample data - see below for structure)
Output: DataFrame with transactions from 2023 only.
Explanation: Filter the DataFrame based on the 'Date' column, keeping only rows where the year is 2023.
Example 2:
Input: sales_data (DataFrame)
Output: Pandas Series showing total revenue per product.
Explanation: Create a new column 'Revenue' by multiplying 'Quantity' and 'UnitPrice'. Then, group by 'Product' and sum the 'Revenue' column.
Example 3:
Input: sales_data (DataFrame)
Output: Pandas Series showing the top 3 categories by total revenue.
Explanation: Calculate total revenue per category (group by 'Category', sum 'Revenue'). Sort the categories by total revenue in descending order and select the top 3.
Sample sales_data DataFrame Structure (for testing):
TransactionID Date Product Category Quantity UnitPrice CustomerID
0 1 2022-12-25 Laptop Electronics 1 1200.00 101
1 2 2023-01-10 Keyboard Accessories 2 75.00 102
2 3 2023-02-15 Mouse Accessories 3 25.00 101
3 4 2022-11-05 Monitor Electronics 1 300.00 103
4 5 2023-03-20 Tablet Electronics 1 450.00 102
5 6 2023-04-01 Webcam Accessories 4 50.00 103
6 7 2023-05-12 Laptop Electronics 1 1250.00 101
7 8 2022-10-28 Printer Electronics 1 200.00 102
8 9 2023-06-05 Mouse Accessories 5 20.00 103
9 10 2023-07-18 Keyboard Accessories 1 80.00 101
Constraints
- The
sales_dataDataFrame will always contain at least 10 rows. - The 'Date' column will always be in the format 'YYYY-MM-DD'.
- All numerical columns (
Quantity,UnitPrice) will contain only positive integers or floats. - The solution should be efficient and avoid unnecessary iterations. Vectorized operations are preferred.
- The solution must use Pandas DataFrame operations. Avoid explicit looping through rows.
Notes
- You will need to import the Pandas library.
- Consider using Pandas functions like
query,groupby,sum,sort_values,head, andmeanto solve the problem. - Pay close attention to data types when performing calculations.
- The order of the top 3 categories in the output is not important.
- The provided sample DataFrame is for testing purposes only. Your solution should work correctly with any valid
sales_dataDataFrame. - Ensure your code is well-documented and easy to understand.