DataFrame Data Exploration and Manipulation
This challenge will test your ability to perform fundamental data exploration and manipulation using Python's pandas library. You will be tasked with creating a DataFrame from provided data and then executing various operations to extract, filter, and transform information. This is a crucial skill for anyone working with data analysis and machine learning.
Problem Description
Your task is to create a pandas DataFrame from a given list of dictionaries. After creating the DataFrame, you will need to perform the following operations:
- Display the first 5 rows of the DataFrame.
- Display the last 3 rows of the DataFrame.
- Display the shape (number of rows and columns) of the DataFrame.
- Display the data types of each column.
- Select and display the 'Name' and 'Age' columns.
- Filter the DataFrame to show only individuals older than 30.
- Filter the DataFrame to show only individuals living in 'New York'.
- Add a new column named 'Salary_USD' which is 1.2 times the 'Salary' column (assuming 'Salary' is in a different currency).
- Calculate and display the average age of all individuals.
- Group the data by 'City' and calculate the average 'Age' for each city.
Examples
Example 1:
Input Data (list of dictionaries):
data = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York', 'Salary': 50000},
{'Name': 'Bob', 'Age': 32, 'City': 'Los Angeles', 'Salary': 60000},
{'Name': 'Charlie', 'Age': 28, 'City': 'New York', 'Salary': 55000},
{'Name': 'David', 'Age': 35, 'City': 'Chicago', 'Salary': 70000},
{'Name': 'Eve', 'Age': 22, 'City': 'New York', 'Salary': 48000},
{'Name': 'Frank', 'Age': 40, 'City': 'Los Angeles', 'Salary': 75000}
]
Expected Output (after performing all operations, shown sequentially):
- First 5 rows:
Name Age City Salary 0 Alice 25 New York 50000 1 Bob 32 Los Angeles 60000 2_Charlie 28 New York 55000 3 David 35 Chicago 70000 4 Eve 22 New York 48000 - Last 3 rows:
Name Age City Salary 3 David 35 Chicago 70000 4 Eve 22 New York 48000 5 Frank 40 Los Angeles 75000 - Shape:
(6, 4) - Data Types:
Name object Age int64 City object Salary int64 dtype: object - 'Name' and 'Age' columns:
Name Age 0 Alice 25 1 Bob 32 2 Charlie 28 3 David 35 4 Eve 22 5 Frank 40 - Individuals older than 30:
Name Age City Salary 1 Bob 32 Los Angeles 60000 3 David 35 Chicago 70000 5 Frank 40 Los Angeles 75000 - Individuals in 'New York':
Name Age City Salary 0 Alice 25 New York 50000 2_Charlie 28 New York 55000 4 Eve 22 New York 48000 - DataFrame with 'Salary_USD' column:
Name Age City Salary Salary_USD 0 Alice 25 New York 50000 60000.0 1 Bob 32 Los Angeles 60000 72000.0 2_Charlie 28 New York 55000 66000.0 3 David 35 Chicago 70000 84000.0 4 Eve 22 New York 48000 57600.0 5 Frank 40 Los Angeles 75000 90000.0 - Average Age:
30.0 - Average Age by City:
City Chicago 35.0 Los Angeles 36.0 New York 25.0 Name: Age, dtype: float64
Example 2: (Edge case with empty input)
Input Data:
data = []
Expected Output:
- First 5 rows: (Empty DataFrame)
- Last 3 rows: (Empty DataFrame)
- Shape:
(0, 0) - Data Types: (Will vary based on pandas version, but generally shows no columns or specific dtype if columns were inferred)
- 'Name' and 'Age' columns: (Empty DataFrame)
- Individuals older than 30: (Empty DataFrame)
- Individuals in 'New York': (Empty DataFrame)
- DataFrame with 'Salary_USD' column: (Empty DataFrame, or DataFrame with just 'Salary_USD' if salary was present in columns)
- Average Age:
NaN(or an error if no data to compute average) - Average Age by City: (Empty Series or DataFrame)
Constraints
- The input will always be a list of dictionaries.
- Each dictionary in the list represents a row in the DataFrame.
- The keys of the dictionaries will be the column names.
- Assume all numerical columns are of appropriate numeric types (integers or floats).
- The solution should be implemented using the
pandaslibrary.
Notes
- You will need to import the
pandaslibrary. - Pay attention to the output format for each operation.
- For operations that might result in an empty DataFrame or Series (like filtering an empty input), ensure your code handles this gracefully.
- The
Salary_USDcalculation should result in floating-point numbers. - The grouping operation should return a
pandasSeries where the index is the city name and the values are the average ages.