Hone logo
Hone
Problems

DataFrame Data Exploration and Manipulation

This challenge will test your ability to perform fundamental data exploration and manipulation using Python's pandas library. You will be tasked with creating a DataFrame from provided data and then executing various operations to extract, filter, and transform information. This is a crucial skill for anyone working with data analysis and machine learning.

Problem Description

Your task is to create a pandas DataFrame from a given list of dictionaries. After creating the DataFrame, you will need to perform the following operations:

  1. Display the first 5 rows of the DataFrame.
  2. Display the last 3 rows of the DataFrame.
  3. Display the shape (number of rows and columns) of the DataFrame.
  4. Display the data types of each column.
  5. Select and display the 'Name' and 'Age' columns.
  6. Filter the DataFrame to show only individuals older than 30.
  7. Filter the DataFrame to show only individuals living in 'New York'.
  8. Add a new column named 'Salary_USD' which is 1.2 times the 'Salary' column (assuming 'Salary' is in a different currency).
  9. Calculate and display the average age of all individuals.
  10. Group the data by 'City' and calculate the average 'Age' for each city.

Examples

Example 1:

Input Data (list of dictionaries):

data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York', 'Salary': 50000},
    {'Name': 'Bob', 'Age': 32, 'City': 'Los Angeles', 'Salary': 60000},
    {'Name': 'Charlie', 'Age': 28, 'City': 'New York', 'Salary': 55000},
    {'Name': 'David', 'Age': 35, 'City': 'Chicago', 'Salary': 70000},
    {'Name': 'Eve', 'Age': 22, 'City': 'New York', 'Salary': 48000},
    {'Name': 'Frank', 'Age': 40, 'City': 'Los Angeles', 'Salary': 75000}
]

Expected Output (after performing all operations, shown sequentially):

  1. First 5 rows:
        Name  Age       City  Salary
    0  Alice   25   New York   50000
    1    Bob   32  Los Angeles   60000
    2_Charlie   28   New York   55000
    3   David   35    Chicago   70000
    4     Eve   22   New York   48000
    
  2. Last 3 rows:
        Name  Age         City  Salary
    3  David   35      Chicago   70000
    4    Eve   22     New York   48000
    5  Frank   40  Los Angeles   75000
    
  3. Shape: (6, 4)
  4. Data Types:
    Name      object
    Age        int64
    City      object
    Salary     int64
    dtype: object
    
  5. 'Name' and 'Age' columns:
        Name  Age
    0  Alice   25
    1    Bob   32
    2  Charlie   28
    3   David   35
    4     Eve   22
    5  Frank   40
    
  6. Individuals older than 30:
        Name  Age         City  Salary
    1    Bob   32  Los Angeles   60000
    3  David   35      Chicago   70000
    5  Frank   40  Los Angeles   75000
    
  7. Individuals in 'New York':
        Name  Age      City  Salary
    0  Alice   25  New York   50000
    2_Charlie   28  New York   55000
    4     Eve   22  New York   48000
    
  8. DataFrame with 'Salary_USD' column:
        Name  Age         City  Salary  Salary_USD
    0  Alice   25     New York   50000     60000.0
    1    Bob   32  Los Angeles   60000     72000.0
    2_Charlie   28     New York   55000     66000.0
    3   David   35      Chicago   70000     84000.0
    4     Eve   22     New York   48000     57600.0
    5  Frank   40  Los Angeles   75000     90000.0
    
  9. Average Age: 30.0
  10. Average Age by City:
    City
    Chicago        35.0
    Los Angeles    36.0
    New York       25.0
    Name: Age, dtype: float64
    

Example 2: (Edge case with empty input)

Input Data:

data = []

Expected Output:

  1. First 5 rows: (Empty DataFrame)
  2. Last 3 rows: (Empty DataFrame)
  3. Shape: (0, 0)
  4. Data Types: (Will vary based on pandas version, but generally shows no columns or specific dtype if columns were inferred)
  5. 'Name' and 'Age' columns: (Empty DataFrame)
  6. Individuals older than 30: (Empty DataFrame)
  7. Individuals in 'New York': (Empty DataFrame)
  8. DataFrame with 'Salary_USD' column: (Empty DataFrame, or DataFrame with just 'Salary_USD' if salary was present in columns)
  9. Average Age: NaN (or an error if no data to compute average)
  10. Average Age by City: (Empty Series or DataFrame)

Constraints

  • The input will always be a list of dictionaries.
  • Each dictionary in the list represents a row in the DataFrame.
  • The keys of the dictionaries will be the column names.
  • Assume all numerical columns are of appropriate numeric types (integers or floats).
  • The solution should be implemented using the pandas library.

Notes

  • You will need to import the pandas library.
  • Pay attention to the output format for each operation.
  • For operations that might result in an empty DataFrame or Series (like filtering an empty input), ensure your code handles this gracefully.
  • The Salary_USD calculation should result in floating-point numbers.
  • The grouping operation should return a pandas Series where the index is the city name and the values are the average ages.
Loading editor...
python