22  Using SQL with Python

22.1 Introduction

Structured Query Language (SQL) is the standard language for managing and manipulating relational databases. A relational database is a collection of data organized into tables, where each table consists of rows (records) and columns (attributes). This structured approach ensures data consistency and enables powerful querying capabilities, meaning the ability to retrieve specific, relevant data efficiently and effectively. Querying capabilities allow users to ask detailed questions about the data, such as finding all students in a particular major or calculating the average age of students. These capabilities are enhanced by establishing relationships between tables through unique identifiers known as keys, which link related information across different tables. For example, a university database might have separate tables for students and courses, with a relationship defined by a common student ID.

Python’s versatility allows it to interface seamlessly with SQL, providing powerful tools for querying, analyzing, and manipulating data. This chapter explores how Python interacts with databases, focusing on the sqlite3 module for SQLite databases, which are lightweight, self-contained, and require no additional server setup.

22.2 Why Use SQL with Python?

Python is adept at data manipulation using libraries like Pandas and NumPy. However, as datasets grow in size and complexity, organizing data in a relational database and using SQL to query it becomes increasingly efficient. Combining SQL and Python has the following advantages:

  1. Data Handling

    Python excels at preprocessing and visualizing data using libraries like Pandas and Matplotlib. However, SQL offers unparalleled capabilities for data retrieval and management. SQL’s structured querying system enables efficient access to specific pieces of information from large datasets. For example, a user can retrieve all records of students enrolled in a specific course without loading the entire dataset into memory.

  2. Efficiency

    SQL is designed to handle large datasets with optimized query execution plans. Traditional Python loops may require significant computational resources when working with extensive data, but SQL queries leverage database indexing and optimization techniques to deliver faster results. This efficiency is particularly beneficial when filtering, sorting, or aggregating data.

  3. Scalability

    Relational databases are built to manage vast amounts of data while maintaining performance. Unlike in-memory data structures like Python lists or dictionaries, databases allow storage and retrieval of millions of records with minimal latency. This scalability ensures that data-intensive applications can grow without compromising their efficiency or integrity.

22.3 Setting Up SQLite in Python

SQLite is a serverless database engine that comes bundled with Python. Using SQLite, we can create, query, and manage databases without additional installation. While this chapter focuses on SQLite due to its simplicity and ease of use, Python also supports other SQL options, such as MySQL, PostgreSQL, and Microsoft SQL Server. These databases are more suitable for larger-scale applications and multi-user environments but often require additional setup and server management. By understanding SQLite, you will build a strong foundation for working with more complex SQL databases in Python. Below is an example workflow:

22.3.1 Connecting to a Database

To begin, establish a connection to a database. SQLite is a lightweight, serverless database engine, which means it does not require a separate server process. If the specified database file does not already exist, SQLite creates it automatically upon connection. This makes SQLite ideal for testing and small-scale applications where simplicity and minimal configuration are priorities. For example, a database named example.db can be created and accessed seamlessly using Python’s sqlite3 module.

import sqlite3

# Connect to the database (or create it if it does not exist)
connection = sqlite3.connect('example.db')

# Create a cursor object to execute SQL commands
cursor = connection.cursor()

22.3.2 Creating Tables

SQL tables provide structured storage for data by organizing information into rows and columns. Each table represents a specific entity, such as students or courses, with rows corresponding to individual records and columns defining attributes of those records (e.g., ID, name, age, and major for students). This structure allows for efficient querying and ensures data consistency by enforcing data types and constraints. Let’s create a table for storing student information:

# Define an SQL command to create a table
create_table_query = """
CREATE TABLE IF NOT EXISTS students (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    age INTEGER,
    major TEXT
);
"""

# Execute the SQL command
cursor.execute(create_table_query)

Let’s break down the different parts of the create_table_query:

  • CREATE TABLE IF NOT EXISTS: This command tells SQLite to create a new table only if one with the specified name does not already exist. This prevents errors from attempting to create a duplicate table.

  • students: This is the name of the table. In this case, it represents the entity “students.”

  • id INTEGER PRIMARY KEY: The id column is defined as an integer data type and marked as the primary key. The primary key ensures that each row in the table has a unique identifier, which cannot be null.

  • name TEXT NOT NULL: The name column is defined as a text field that cannot contain null values. This ensures that every student in the table has a name.

  • age INTEGER: The age column is defined as an integer, suitable for storing numeric values like a student’s age. Unlike name, it does not include the NOT NULL constraint, meaning it can be left empty.

  • major TEXT: The major column is defined as a text field for storing the student’s major. It is also optional, as it does not include the NOT NULL constraint.

Alternate Data Types and Constraints

Instead of INTEGER or TEXT, other data types like REAL (for floating-point numbers) or BLOB (for binary data like images) could be used depending on the requirements. Additional constraints such as UNIQUE (to ensure no duplicate values), DEFAULT (to set a default value for a column), or CHECK (to enforce conditions on data) can also be added to enhance data integrity.

22.3.3 Inserting Data

The INSERT INTO statement is used to insert new rows of data into an existing table. In the example below, we are adding a single record to the students table. Let’s break down the different parts of the code:

  • INSERT INTO students (name, age, major): This specifies the table (students) and the columns (name, age, major) where the new data will be added. The column names must match the structure defined during the table creation.

  • VALUES ('John Doe', 20, 'Mathematics'): This section provides the actual values to insert into the corresponding columns. Here, ‘John Doe’ is a string representing the name, 20 is an integer for the age, and ‘Mathematics’ is the major, stored as text.

  • cursor.execute(insert_query): The execute method runs the SQL command. The insert_query variable holds the complete SQL statement.

  • connection.commit(): This ensures that the changes are saved to the database. Without this step, the data would not be permanently stored.

# Insert a single record
insert_query = """
INSERT INTO students (name, age, major)
VALUES ('John Doe', 20, 'Mathematics');
"""
cursor.execute(insert_query)

# Commit changes to the database
connection.commit()

Alternate Methods for Inserting Data

You can also use parameterized queries to avoid hardcoding values and reduce the risk of security vulnerabilities. For example:

name, age, major = 'Jane Smith', 22, 'Physics'
insert_query = "INSERT INTO students (name, age, major) VALUES (?, ?, ?);"
cursor.execute(insert_query, (name, age, major))
connection.commit()

This method is dynamic and safer, especially when working with user-provided data.

22.3.4 Querying Data

The SELECT statement is one of the most commonly used SQL commands for retrieving data from a table. Let’s break down the different parts of the example below.

  • select_query = "SELECT * FROM students;": This SQL query retrieves all columns (*) and rows from the students table. The * is a wildcard that signifies selecting all available fields. For more specific queries, you can replace * with column names (e.g., SELECT name, age FROM students;).

  • cursor.execute(select_query): The execute method runs the SQL query against the connected database. This step sends the query string to the database for processing.

  • results = cursor.fetchall(): The fetchall method retrieves all the rows returned by the query as a list of tuples. Each tuple represents a row in the table, with each element corresponding to a column.

  • for row in results:: This loop iterates over the result set to process and display each row.

# Retrieve all students from the database
select_query = "SELECT * FROM students;"
cursor.execute(select_query)

# Fetch and display results
results = cursor.fetchall()
for row in results:
    print(row)

Filtering and Sorting Data

You can enhance the SELECT query to filter or sort results:

  • Filtering: Use a WHERE clause to specify conditions, such as SELECT * FROM students WHERE age > 20; to retrieve students older than 20.

  • Sorting: Use an ORDER BY clause, such as SELECT * FROM students ORDER BY age ASC; to sort the results by age in ascending order.

These enhancements make queries more targeted and efficient for specific analytical tasks.

22.4 Updating and Deleting

UPDATE

The UPDATE command modifies existing rows in a table. For example:

UPDATE students 
SET major = 'Statistics' 
WHERE name = 'Alice';

This changes the major of the student named Alice to ‘Statistics’. The WHERE clause ensures that only rows meeting the condition are updated.

Example:

update_query = """
UPDATE students
SET major = 'Statistics'
WHERE name = 'John Doe';
"""
cursor.execute(update_query)
connection.commit()

DELETE

The DELETE command removes rows from a table. For example:

DELETE FROM students WHERE name = 'Alice';

This deletes the record for the student named Alice. Without a WHERE clause, all rows in the table would be deleted, so it’s essential to use this command carefully.

Example:

delete_query = """
DELETE FROM students WHERE name = 'John Doe';
"""
cursor.execute(delete_query)
connection.commit()

22.5 Using SQL with Pandas

Pandas integrates with SQL for advanced data analysis. To load data from a database into a Pandas DataFrame:

import pandas as pd

# Query the database and load results into a DataFrame
query = "SELECT * FROM students;"
df = pd.read_sql_query(query, connection)
print(df.head())

Similarly, you can export a DataFrame to an SQL table:

data = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [22, 23],
    'major': ['Computer Science', 'Physics']
})

data.to_sql('students', connection, if_exists='append', index=False)

22.6 Best Practices for Using SQL with Python

  • Parameterized Queries: Prevent SQL security vulnerabilities by using placeholders in queries:
name = 'Alice'
query = "SELECT * FROM students WHERE name = ?;"
cursor.execute(query, (name,))
  • Close Connections: Always close the connection to release resources:
connection.close()
  • Error Handling: Use try and except to catch and handle database errors:
try:
    cursor.execute("INVALID SQL COMMAND")
except sqlite3.Error as e:
    print("Error occurred:", e)

22.7 Exercises

Exercise 1: library database

  1. Create a database named library.db with a table books that stores the title, author, and publication year of books.

  2. Insert five records into the books table and retrieve all records where the publication year is after 2000.

  3. Update the author of a specific book and then delete another book from the table.

  4. Use Pandas to load the contents of the books table into a DataFrame, then add a new column for genres and export the updated DataFrame back to the database.