14  Pandas

14.1 Introduction to Pandas

Data analysis often involves handling large volumes of structured data and performing operations such as data cleaning, transformation, and statistical analysis. The ability to efficiently manipulate data is crucial for extracting meaningful insights. Pandas is a powerful open-source library in Python specifically designed to make these tasks easier and more intuitive.

14.1.1 Why Use Pandas?

Pandas, short for “Panel Data,” is built on top of NumPy and provides high-level data manipulation tools that are both fast and versatile. It is particularly well-suited for handling and analyzing data in the form of tables, such as those found in spreadsheets or SQL databases. The two primary data structures in Pandas, the Series and the DataFrame, offer powerful ways to organize and manipulate data, making it possible to perform complex data analysis with just a few lines of code.

The development of Pandas by Wes McKinney in 2008 was driven by the need for a flexible and efficient tool that could handle the demands of modern data analysis. Since its creation, Pandas has become the de facto standard for data manipulation in Python, widely used in fields such as finance, statistics, machine learning, and scientific research.

14.1.2 Key Features of Pandas

Pandas offers several key features that make it an essential tool for data analysis:

  1. Data Alignment and Indexing: Pandas automatically aligns data during arithmetic operations, handling missing data with ease. This feature ensures that data manipulations are efficient and that operations can be performed across different datasets without losing track of the relationships between data points.

  2. Handling Missing Data: In real-world datasets, missing data is a common occurrence. Pandas provides robust methods for detecting, handling, and imputing missing values, allowing analysts to maintain data integrity and make informed decisions.

  3. Data Cleaning and Preparation: Pandas simplifies the process of data cleaning, including handling duplicates, filtering data, and transforming variables into the desired formats. Clean data is essential for producing accurate and reliable analysis results.

  4. Flexible Data Structures: The Series and DataFrame objects in Pandas are highly flexible. They can handle different data types, including integers, floats, strings, and even objects, all within the same data structure. This flexibility is crucial when dealing with heterogeneous datasets.

  5. Integration with Other Libraries: Pandas integrates seamlessly with other Python libraries such as NumPy, Matplotlib, and SciPy, making it easy to combine data analysis with numerical operations, visualization, and statistical techniques.

  6. High Performance: Pandas is optimized for performance with large datasets. Its operations are built on top of the highly efficient NumPy library, allowing for quick execution of complex data manipulations.

14.2 Series

A Series in Pandas is a one-dimensional array-like object that can hold data of any type (integers, strings, floating-point numbers, etc.). A Series is similar to a column in a spreadsheet or a database. Each element in a Series has an associated label called an index, which allows for intuitive data handling and alignment.

14.2.1 Creating a Series

You can create a Series in Pandas using the pd.Series() function, where pd is the standard alias for the Pandas module.

import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)
0    10
1    20
2    30
3    40
4    50
dtype: int64

Here, the numbers on the left represent the index of the Series, and the numbers on the right are the data values.

Accessing Elements in a Series

You can access individual elements in a Series using the index. For example, to access the third element:

value = series[2]
print(value)
30

14.2.2 Handling Missing Data in a Series

Missing data is a common issue when dealing with real-world datasets. Pandas represents missing values as NaN (Not a Number), and it provides built-in methods to handle these missing values.

Creating a Series with Missing Data

# Creating a Series with missing data
data = [10, 20, None, 40, 50]
series_with_nan = pd.Series(data)
print(series_with_nan)
0    10.0
1    20.0
2     NaN
3    40.0
4    50.0
dtype: float64

In this example, the third value is None, which Pandas automatically converts to NaN to signify missing data.

Handling Missing Data

Pandas provides several methods to handle missing data in a Series:

  • Removing Missing Values: You can use the dropna() method to remove any missing values from the Series.
# Removing missing values
clean_series = series_with_nan.dropna()
print(clean_series)
0    10.0
1    20.0
3    40.0
4    50.0
dtype: float64
  • Filling Missing Values: The fillna() method allows you to replace missing values with a specified value.
# Filling missing values with a specific number
filled_series = series_with_nan.fillna(0)
print(filled_series)
0    10.0
1    20.0
2     0.0
3    40.0
4    50.0
dtype: float64

These methods provide flexibility in handling missing data, depending on the specific requirements of your analysis.

14.2.3 Handling Duplicates in a Series

Duplicates can often occur in data, and it is essential to handle them to ensure the accuracy of your analysis. Pandas makes it easy to identify and remove duplicate values in a Series.

Identifying Duplicates

You can use the duplicated() method to identify duplicate values in a Series. This method returns a boolean Series indicating whether each value is a duplicate.

# Creating a Series with duplicate values
data = [10, 20, 20, 30, 40, 40, 50]
series_with_duplicates = pd.Series(data)

# Identifying duplicates
duplicates = series_with_duplicates.duplicated()
print(duplicates)
0    False
1    False
2     True
3    False
4    False
5     True
6    False
dtype: bool

Removing Duplicates

To remove duplicate values, you can use the drop_duplicates() method:

# Removing duplicate values
unique_series = series_with_duplicates.drop_duplicates()
print(unique_series)
0    10
1    20
3    30
4    40
6    50
dtype: int64

This method retains only the first occurrence of each value, removing all subsequent duplicates.

14.2.4 Performing Numeric Computations with a Series

One of the strengths of a Pandas Series is its ability to perform vectorized operations, making numeric computations fast and intuitive. You can perform arithmetic operations on a Series just as you would with individual numbers.

Basic Arithmetic Operations

Pandas supports element-wise operations, allowing you to perform calculations directly on a Series.

# Creating a Series
data = [10, 20, 30, 40, 50]
numeric_series = pd.Series(data)

# Performing arithmetic operations
sum_series = numeric_series + 5
product_series = numeric_series * 2
squared_series = numeric_series ** 2

print("Sum Series:\n", sum_series)
print("Product Series:\n", product_series)
print("Squared Series:\n", squared_series)
Sum Series:
 0    15
1    25
2    35
3    45
4    55
dtype: int64
Product Series:
 0     20
1     40
2     60
3     80
4    100
dtype: int64
Squared Series:
 0     100
1     400
2     900
3    1600
4    2500
dtype: int64

Summary Statistics

Pandas also provides convenient methods for calculating summary statistics on a Series, such as the mean, median, and standard deviation.

# Calculating summary statistics
mean_value = numeric_series.mean()
median_value = numeric_series.median()
std_dev = numeric_series.std()

print("Mean:", mean_value)
print("Median:", median_value)
print("Standard Deviation:", std_dev)
Mean: 30.0
Median: 30.0
Standard Deviation: 15.811388300841896

These built-in methods allow you to quickly analyze the statistical properties of your data, providing insights with minimal code.

14.3 DataFrame

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a table in a database or an Excel spreadsheet. DataFrames are one of the most commonly used data structures in data analysis and provide a powerful tool for working with structured data in Python.

14.3.1 Creating a DataFrame

DataFrames can be created in various ways, such as from dictionaries, lists, or external data sources like CSV files. Here’s an example of creating a DataFrame from a dictionary:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)
print(df)
      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston

The DataFrame structure consists of rows and columns, where each column represents a Series.

14.3.2 Indexing DataFrames

Indexing in Pandas DataFrames is a way to access, filter, and manipulate data efficiently. DataFrames support multiple methods of indexing, which can be used to select specific rows, columns, or even individual values. Understanding how to use indexing properly is essential for effective data analysis.

Types of Indexing in DataFrames

Pandas provides several ways to index a DataFrame:

  1. Label-based indexing using .loc[]
  2. Integer-based indexing using .iloc[]
  3. Boolean indexing

Label-based Indexing with .loc[]

The .loc[] method is used to select data based on the labels of rows and columns. It allows you to access a group of rows and columns by labels or a boolean array.

import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

# Using .loc[] to select data by labels
# Selecting the row with label 1 (Bob) and specific columns
selected_data = df.loc[1, ['Name', 'City']]
print(selected_data)
Name            Bob
City    Los Angeles
Name: 1, dtype: object

In this example, the .loc[] method selects the data from the row with index label 1 and returns the values from the Name and City columns.

Integer-based Indexing with .iloc[]

The .iloc[] method is used to select data by position, similar to how you would index arrays in Python. It uses integer-based positions to access data.

# Using .iloc[] to select data by position
# Selecting the first two rows and the first two columns
selected_data = df.iloc[0:2, 0:2]
print(selected_data)
    Name  Age
0  Alice   24
1    Bob   27

Here, .iloc[] selects the data from the first two rows (0:2) and the first two columns (0:2), providing a quick way to subset the DataFrame using integer positions.

Boolean Indexing

Boolean indexing allows you to filter data in a DataFrame based on conditions. This is useful for selecting rows that meet specific criteria.

# Using Boolean indexing to filter rows where Age is greater than 25
filtered_data = df[df['Age'] > 25]
print(filtered_data)
    Name  Age         City
1    Bob   27  Los Angeles
3  David   32      Houston

In this example, we used a condition (df['Age'] > 25) to filter the DataFrame, returning only the rows where the Age column has values greater than 25.

14.3.3 Handling Missing Data in a DataFrame

Missing data is a frequent issue in datasets. Pandas provides several methods to handle missing data, allowing you to clean and prepare your data efficiently.

Creating a DataFrame with Missing Data

# Creating a DataFrame with missing data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, None, 22, 32],
    'City': ['New York', 'Los Angeles', None, 'Houston']
}

df_with_nan = pd.DataFrame(data)
print(df_with_nan)
      Name   Age         City
0    Alice  24.0     New York
1      Bob   NaN  Los Angeles
2  Charlie  22.0         None
3    David  32.0      Houston

Handling Missing Data

Pandas provides several ways to handle missing data in a DataFrame:

  • Removing Missing Values: Use the dropna() method to remove any rows or columns with missing values.
# Dropping rows with missing values
cleaned_df = df_with_nan.dropna()
print(cleaned_df)
    Name   Age      City
0  Alice  24.0  New York
3  David  32.0   Houston

In this example, only the rows without missing values are retained.

  • Filling Missing Values: Use the fillna() method to fill missing values with a specified value or a method like the mean.
# Filling missing values in the Age column with the average age
df_filled = df_with_nan.fillna({'Age': df_with_nan['Age'].mean(), 'City': 'Unknown'})
print(df_filled)
      Name   Age         City
0    Alice  24.0     New York
1      Bob  26.0  Los Angeles
2  Charlie  22.0      Unknown
3    David  32.0      Houston

In this example, missing values in the Age column are filled with the mean of the existing ages, while missing values in the City column are replaced with “Unknown.”

14.3.4 Handling Duplicates in a DataFrame

Duplicate data can often cause inaccuracies in analysis. Pandas makes it easy to identify and remove duplicates.

Identifying Duplicates

You can use the duplicated() method to find duplicate rows in the DataFrame.

# Creating a DataFrame with duplicate rows
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice'],
    'Age': [24, 27, 22, 32, 24],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York']
}

df_with_duplicates = pd.DataFrame(data)

# Identifying duplicate rows
duplicates = df_with_duplicates.duplicated()
print(duplicates)
0    False
1    False
2    False
3    False
4     True
dtype: bool

Removing Duplicates

To remove duplicate rows, use the drop_duplicates() method:

# Removing duplicate rows
df_unique = df_with_duplicates.drop_duplicates()
print(df_unique)

This removes the duplicate occurrence of Alice from the DataFrame.

14.3.5 Performing Numeric Computations with a DataFrame

DataFrames support a wide range of arithmetic operations that can be performed on numerical data, making it easy to compute and analyze your data.

Basic Arithmetic Operations

You can perform operations on individual columns or the entire DataFrame.

# Creating a DataFrame
data = {
    'Product': ['A', 'B', 'C'],
    'Price': [100, 150, 200],
    'Quantity': [3, 4, 5]
}

df = pd.DataFrame(data)

# Calculating the total cost for each product
df['Total Cost'] = df['Price'] * df['Quantity']
print(df)
  Product  Price  Quantity  Total Cost
0       A    100         3         300
1       B    150         4         600
2       C    200         5        1000

This example demonstrates how to create a new column (Total Cost) by multiplying the Price and Quantity columns.

Summary Statistics

Pandas makes it easy to calculate summary statistics for a DataFrame using built-in methods like mean(), sum(), and max().

# Calculating summary statistics
average_price = df['Price'].mean()
total_quantity = df['Quantity'].sum()

print("Average Price:", average_price)
print("Total Quantity Sold:", total_quantity)
Average Price: 150.0
Total Quantity Sold: 12

These operations allow you to quickly analyze data trends and obtain insights with minimal code.

14.4 Sorting, Filtering, and Merging DataFrames

When working with data, it is often necessary to sort, filter, and merge datasets to analyze information efficiently. Pandas provides intuitive methods for these operations, allowing you to organize and manipulate data easily.

14.4.1 Sorting DataFrames

Sorting data is essential when you want to analyze your DataFrame in a specific order. Pandas allows you to sort DataFrames by their row indices or column values using the sort_values() and sort_index() methods.

Sorting by Column Values

You can sort a DataFrame by the values in one or more columns using the sort_values() method.

import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

# Sorting the DataFrame by the 'Age' column in ascending order
sorted_df = df.sort_values(by='Age')
print(sorted_df)
      Name  Age         City
2  Charlie   22      Chicago
0    Alice   24     New York
1      Bob   27  Los Angeles
3    David   32      Houston

This example sorts the DataFrame by the Age column, arranging the rows in ascending order. You can sort in descending order by setting the parameter ascending=False.

Sorting by Index

To sort the DataFrame by its index, use the sort_index() method.

# Sorting the DataFrame by its index in descending order
sorted_by_index = df.sort_index(ascending=False)
print(sorted_by_index)
      Name  Age         City
3    David   32      Houston
2  Charlie   22      Chicago
1      Bob   27  Los Angeles
0    Alice   24     New York

This arranges the DataFrame in descending order based on the row index.

14.4.2 Filtering DataFrames

Filtering data allows you to extract specific rows that meet particular conditions. This is useful for analyzing subsets of your data.

Filtering Rows Based on a Condition

You can filter a DataFrame to include only the rows that meet a specific condition using boolean indexing.

# Filtering rows where the Age is greater than 25
filtered_df = df[df['Age'] > 25]
print(filtered_df)
    Name  Age         City
1    Bob   27  Los Angeles
3  David   32      Houston

In this example, only the rows where the Age column has a value greater than 25 are included.

Filtering with Multiple Conditions

You can also filter rows based on multiple conditions using the logical operators & (and), | (or), and ~ (not).

# Filtering rows where Age is greater than 25 and City is 'Los Angeles'
filtered_df = df[(df['Age'] > 25) & (df['City'] == 'Los Angeles')]
print(filtered_df)
  Name  Age         City
1  Bob   27  Los Angeles

This filters the DataFrame to include only rows where the Age is greater than 25 and the City is ‘Los Angeles’.

14.4.3 Merging DataFrames

Merging is the process of combining multiple DataFrames into a single DataFrame based on a common key or index. Pandas provides several methods for merging, including merge(), concat(), and join().

Merging DataFrames with merge()

The merge() function is similar to SQL joins and is used to combine DataFrames based on a key column.

# Creating two DataFrames to merge
data1 = {'Name': ['Alice', 'Bob', 'Charlie'],
         'Age': [24, 27, 22]}
df1 = pd.DataFrame(data1)

data2 = {'Name': ['Alice', 'Bob', 'David'],
         'City': ['New York', 'Los Angeles', 'Houston']}
df2 = pd.DataFrame(data2)

# Merging the DataFrames on the 'Name' column
merged_df = pd.merge(df1, df2, on='Name', how='inner')
print(merged_df)
    Name  Age         City
0  Alice   24     New York
1    Bob   27  Los Angeles

In this example, we performed an inner join on the Name column, returning only the rows with matching values in both DataFrames. You can change the how parameter to perform different types of joins:

  • how='inner' (default) returns only the intersection of the keys.
  • how='outer' returns all keys from both DataFrames.
  • how='left' returns all keys from the left DataFrame and matches them with the right.
  • how='right' returns all keys from the right DataFrame and matches them with the left.

Concatenating DataFrames with concat()

The concat() function is used to concatenate DataFrames along a particular axis (rows or columns).

# Concatenating two DataFrames along the rows (axis=0)
df3 = pd.DataFrame({'Name': ['Eve', 'Frank'],
                    'Age': [29, 34],
                    'City': ['Seattle', 'Boston']})

concatenated_df = pd.concat([df, df3], ignore_index=True)
print(concatenated_df)
      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston
4      Eve   29      Seattle
5    Frank   34       Boston

This example shows how to concatenate two DataFrames by stacking them on top of each other.

14.5 Reading and Writing Data

One of the most common tasks in data analysis is reading data from files and writing data to them. Pandas makes it easy to handle these operations with its built-in functions.

14.5.1 Reading Data from Files

Pandas supports reading data from various file formats, such as CSV, Excel, JSON, and more. The most commonly used format is CSV (Comma-Separated Values). You can read a CSV file into a DataFrame using the pd.read_csv() function:

df = pd.read_csv('data.csv')

This command reads the contents of the data.csv file and stores it as a DataFrame in the variable df. Pandas automatically infers the data types and the structure of the file.

14.5.2 Writing Data to Files

Similarly, you can write a DataFrame to a file using the to_csv() method. This is useful for saving your data after performing analysis or making modifications.

df.to_csv('output.csv', index=False)

The parameter index=False ensures that the index is not included in the saved file.

14.6 Exercises

Exercise 1: Creating and Manipulating Series

Create a Pandas Series from the following list of temperatures (in Celsius): [22, 25, 23, 20, 27, None, 28, 24].

A. Print the Series.

B. Identify any missing data.

C. Replace the missing data with the average of the other temperatures.

Exercise 2: Creating and Manipulating Series

Create a Series of temperatures using the list [22, 25, 23, 22, 25, 28, 24, 22].

A. Identify and remove any duplicate values.

B. Sort the Series in descending order.

Exercise 3: Creating and Manipulating Series

Perform the following calculations on the Series you created in Exercise 2:

A. Multiply each temperature by 9/5 and then add 32 to convert the values from Celsius to Fahrenheit.

B. Find the mean, median, and standard deviation of the converted temperatures.

Exercise 4: Working with DataFrames

Create a DataFrame from the following dictionary:

data = {
   'Name': ['Alice', 'Bob', 'Charlie', 'David'],
   'Age': [24, None, 22, 32],
   'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

A. Print the DataFrame.

B. Identify and remove any rows with missing values.

C. Fill in the missing values in the Age column with the average age.

Exercise 5: Working with DataFrames

Using the same DataFrame from Exercise 4, perform the following tasks:

A. Sort the DataFrame by the Age column in ascending order.

B. Filter the DataFrame to include only rows where the Age is greater than 23.

Exercise 6: Working with DataFrames

Create two new DataFrames:

  • DataFrame 1:

    data1 = {'Name': ['Alice', 'Bob', 'Charlie'],
             'Age': [24, 27, 22]}
  • DataFrame 2:

    data2 = {'Name': ['Alice', 'Charlie', 'David'],
             'City': ['New York', 'Chicago', 'Houston']}

A. Merge these two DataFrames on the Name column using an inner join.

B. Perform an outer join on the same DataFrames and observe the differences.

Exercise 7: Sorting, Filtering, and Merging

Create a DataFrame using the following data:

data = {
    'Product': ['Apples', 'Bananas', 'Cherries', 'Dates'],
    'Price': [3, 1, 4, 2],
    'Quantity': [5, 7, 6, 3]
}

A. Sort the DataFrame by Price in ascending order.

B. Filter the DataFrame to include only products with a Quantity greater than 4.

C. Add a new column to the DataFrame called Total Cost that calculates the total cost as Price * Quantity.

Exercise 8: Sorting, Filtering, and Merging

Create a new DataFrame with the following data:

data = {
    'Product': ['Apples', 'Cherries', 'Dates'],
    'Supplier': ['Supplier A', 'Supplier B', 'Supplier C']
}

A. Merge this DataFrame with the DataFrame from Exercise 7 using the Product column as the key.

B. Experiment with different types of joins (inner, outer, left, and right) and note the changes in the resulting DataFrame.

Exercise 9: Reading and Writing Data

Save the following DataFrame to a CSV file named students.csv:

data = {
    'Name': ['Anna', 'Ben', 'Cathy', 'Dan'],
    'Score': [88, 92, 85, 90]
}

A. Write the DataFrame to the CSV file, ensuring that the index is not included in the file.

B. Read the data back into a DataFrame and display its contents.