10 2 3 Data Manipulation Explained

Key Concepts

Data manipulation in Python involves several key concepts:

Filtering Data
Sorting Data
Grouping Data
Aggregating Data
Merging and Joining Data

1. Filtering Data

Filtering data involves selecting specific rows or columns based on certain conditions. This is useful for isolating relevant information.

Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)

# Filtering rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)

Analogy: Think of filtering as picking out specific fruits from a basket based on their color or size.

2. Sorting Data

Sorting data involves arranging rows based on the values of one or more columns. This helps in understanding the data better.

Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)

# Sorting by Age in ascending order
sorted_df = df.sort_values(by='Age')
print(sorted_df)

Analogy: Sorting is like arranging books on a shelf alphabetically by their titles.

3. Grouping Data

Grouping data involves splitting the data into groups based on some criteria and applying a function to each group.

Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)

# Grouping by City and calculating the mean age
grouped_df = df.groupby('City')['Age'].mean()
print(grouped_df)

Analogy: Grouping is like categorizing items in a store by their type and calculating the average price for each category.

4. Aggregating Data

Aggregating data involves applying statistical functions to summarize the data. Common functions include sum, mean, count, etc.

Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)

# Calculating the total and average age
total_age = df['Age'].sum()
average_age = df['Age'].mean()
print("Total Age:", total_age)
print("Average Age:", average_age)

Analogy: Aggregating is like counting the total number of items in a shopping cart and calculating the average price per item.

5. Merging and Joining Data

Merging and joining data involves combining two or more datasets based on common columns. This is useful for integrating data from different sources.

Example:

import pandas as pd

data1 = {'Name': ['Alice', 'Bob', 'Charlie'],
         'Age': [25, 30, 35]}
df1 = pd.DataFrame(data1)

data2 = {'Name': ['Alice', 'Bob', 'David'],
         'City': ['New York', 'Los Angeles', 'Houston']}
df2 = pd.DataFrame(data2)

# Merging dataframes on the 'Name' column
merged_df = pd.merge(df1, df2, on='Name')
print(merged_df)

Analogy: Merging is like combining two spreadsheets by matching entries based on a common column, such as a customer ID.

Putting It All Together

By understanding and using these concepts effectively, you can manipulate data efficiently for various analytical tasks.

Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)

# Filtering and sorting
filtered_sorted_df = df[df['Age'] > 30].sort_values(by='Age')

# Grouping and aggregating
grouped_df = df.groupby('City')['Age'].agg(['mean', 'sum'])

# Merging with another dataset
data2 = {'Name': ['Alice', 'Bob', 'David'],
         'City': ['New York', 'Los Angeles', 'Houston']}
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df, df2, on='Name')

print("Filtered and Sorted Data:\n", filtered_sorted_df)
print("Grouped and Aggregated Data:\n", grouped_df)
print("Merged Data:\n", merged_df)