20  Web Scraping

20.1 Introduction to Web Scraping

Web scraping, or data extraction from websites, is a powerful technique that allows users to programmatically collect data from web pages and use it for a range of applications, from statistical analysis to real-time data-driven applications. This process enables us to take unstructured data in HTML format and transform it into a structured form that is suitable for further analysis and computation in Python.

Web scraping is especially valuable when data is not readily available via a standard API (Application Programming Interface), or if the data exists only on web pages. By automating the retrieval and organization of this information, we can work more efficiently and even capture data over time to analyze trends and patterns.

20.1.1 Applications of Web Scraping

Web scraping is used across various fields, including:

  • Market Research and Competitive Analysis: Organizations monitor competitors’ websites to gather data on pricing, product availability, customer reviews, and more.
  • Real-Time Data Gathering: For example, scraping financial or weather data in real time allows businesses to react to market changes or environmental conditions promptly.
  • Academic Research: Researchers use web scraping to collect large datasets from scientific journals, social media, or news sites for data-driven studies.
  • Data Aggregation: Many websites, such as travel aggregators, rely on scraping data from various sources to provide users with comparative data on flight prices, hotel rates, etc.

20.1.2 Core Libraries for Web Scraping

Python provides a range of libraries for web scraping, but two foundational tools often used together are requests and BeautifulSoup.

  • requests: This library simplifies the process of sending HTTP requests, such as GET requests, to retrieve HTML data from a specified URL. By providing a simple interface, requests handles much of the complexity associated with connecting to web servers.

  • BeautifulSoup: A parsing library built specifically for handling HTML and XML files. BeautifulSoup allows us to navigate the structure of an HTML document, search for elements by tags or attributes, and extract content.

By combining requests to access the HTML and BeautifulSoup to parse it, we can build powerful tools for extracting data from a variety of websites. These libraries, along with basic knowledge of HTML, enable us to automate data extraction for a wide range of applications.

20.1.3 Web Scraping Process Overview

Web scraping typically involves the following steps:

  1. Identify the Target Data: Determine which website and specific elements (such as tables, links, images) contain the data of interest. Familiarizing yourself with the website’s HTML structure helps make the extraction process more efficient.

  2. Send a Request to Retrieve HTML Content: Use requests to access the webpage and download its HTML structure. Ensuring the status code is 200 (OK) indicates a successful request.

  3. Parse the HTML: With BeautifulSoup, parse the retrieved HTML. This allows you to locate and extract the content within specific tags or classes.

  4. Data Cleaning and Storage: After extraction, clean and organize the data for analysis. This often involves formatting the data, handling missing values, and saving it to a structured format like a CSV file or a database.

20.2 HTML Basics

To effectively perform web scraping, it is essential to understand the structure of HTML (HyperText Markup Language), the core language that defines the content and layout of web pages. HTML organizes content using a system of nested elements called tags, each with specific purposes and attributes. Recognizing the role of these tags helps us locate and extract the desired data with precision.

20.2.1 Key Components of HTML

HTML documents consist of several fundamental components, including elements, attributes, and text content. Understanding these elements provides the foundation needed for successful web scraping.

  • Elements: HTML elements are the building blocks of a webpage, defined by tags. A tag indicates the start of an element (e.g., <p> for a paragraph) and often requires a closing tag (e.g., </p>).

    <p>This is a paragraph.</p>

    Common tags include <h1> through <h6> for headings, <p> for paragraphs, <a> for hyperlinks, and <div> for division or grouping of content.

  • Attributes: HTML attributes provide additional information about an element, often helping uniquely identify or style specific elements. Attributes appear within the opening tag of an element.

    <p class="intro">This paragraph has a class attribute.</p>

    In this example, class="intro" is an attribute that assigns a class to the paragraph. The class and id attributes are particularly useful in web scraping because they allow us to precisely locate elements in a webpage’s HTML structure.

  • Text Content: The content between the opening and closing tags of an element is referred to as text content. This is the primary information that we aim to extract in web scraping.

    <h1>Web Scraping Basics</h1>

    Here, “Web Scraping Basics” is the text content within the <h1> tag.

20.2.3 Common HTML Tags for Web Scraping

Some tags are more commonly encountered when scraping for specific types of data:

  • Headings (<h1>, <h2>, etc.): Used for titles and section headers.

  • Paragraphs (<p>): Often contain text content such as descriptions, summaries, or main body text.

  • Links (<a>): Defined by the <a> tag, links often have the href attribute, which provides the URL to the linked page. Links are useful for scraping URLs to additional pages or resources.

    <a href="https://example.com">Example Link</a>
  • Images (<img>): The <img> tag contains an src attribute pointing to the image’s URL. Scraping images involves collecting these URLs for further processing or display.

    <img src="https://example.com/image.jpg" alt="Example Image">
  • Tables (<table>, <tr>, <td>): Tables are used to organize data in rows and columns, with <tr> tags representing rows and <td> tags representing individual cells. Tables are common for structured data, such as product listings, schedules, or datasets displayed in tabular format.

    <table>
      <tr>
        <td>Row 1, Cell 1</td>
        <td>Row 1, Cell 2</td>
      </tr>
      <tr>
        <td>Row 2, Cell 1</td>
        <td>Row 2, Cell 2</td>
      </tr>
    </table>

20.2.4 Using Attributes for Targeted Extraction

In web scraping, attributes like id and class are invaluable for pinpointing specific elements, especially when scraping data from complex or crowded pages.

  • class Attribute: Often used to apply styles to groups of elements, the class attribute can also serve as an identifier when multiple elements share the same styling.

    <p class="info">This is a paragraph with the class 'info'.</p>
  • id Attribute: Unlike class, an id attribute is unique to a single element on the page, making it useful for uniquely identifying specific elements.

    <div id="main-content">This is the main content area.</div>
  • Combining Tags and Attributes: In practice, you can target specific elements by using both the tag name and attributes in combination. For example, with BeautifulSoup, locating an element with a specific class or id is straightforward:

    content = soup.find("div", id="main-content")

By understanding and utilizing these HTML elements and attributes, we can refine our web scraping techniques to extract data accurately and efficiently.

20.3 Sending HTTP Requests with requests

To extract data from a web page, the first step in web scraping is to retrieve the page’s HTML content. This process involves sending an HTTP request to the server hosting the page, which, in response, sends back the HTML structure and content. The Python library requests makes this process straightforward by providing a clean interface for sending HTTP requests, retrieving responses, and handling various request types.

The HTTP protocol supports several types of requests, with the two most common being:

  • GET requests: Retrieve data from a specified resource. In web scraping, GET requests are used to fetch HTML pages from web servers.
  • POST requests: Send data to a server to create or update a resource. While commonly used in forms and other data-submission processes, POST requests are less common in basic web scraping.

In this section, we focus on GET requests, as they are typically used for retrieving static HTML content for scraping.

20.3.1 Basic Syntax for Sending a GET Request

The requests library makes it easy to send an HTTP GET request. Here’s the basic syntax:

import requests

url = "https://example.com"
response = requests.get(url)

In this example:

  • url is the target webpage’s address.
  • requests.get(url) sends a GET request to the URL, retrieving the HTML content.
  • response stores the server’s response, including the HTML.

After sending the request, you can inspect various components of the response object, such as the HTML content, response status, and response headers.

20.3.2 Inspecting the Response Status Code

When we send a request, the server responds with a status code indicating whether the request was successful. Common HTTP status codes include:

  • 200 (OK): The request was successful, and the server returned the requested content.
  • 404 (Not Found): The server could not find the requested resource.
  • 500 (Internal Server Error): The server encountered an error.

Checking the status code helps determine whether the request was successful before proceeding with HTML parsing. You can access the status code using the .status_code attribute of the response object:

if response.status_code == 200:
    print("Request successful!")
else:
    print("Failed to retrieve the page. Status code:", response.status_code)
Request successful!

20.3.3 Retrieving HTML Content from the Response

If the status code indicates success (200), you can proceed to extract the HTML content using response.text, which provides the HTML as a string:

html_content = response.text
print(html_content)
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

Alternatively, response.content provides the raw byte data, which is useful if you’re working with non-HTML files (e.g., images or PDFs). For standard HTML scraping, however, response.text is typically more convenient.

20.3.4 Handling Request Headers and User Agents

Some websites use mechanisms to detect automated requests and may block them if they suspect scraping activity. One common strategy to avoid this is to modify the user-agent in the request header, which indicates the type of browser or client making the request.

The requests library allows you to add headers, including a custom user-agent, to mimic a legitimate browser:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
response = requests.get(url, headers=headers)

In this example:

  • headers defines a dictionary with the User-Agent key, mimicking a browser like Chrome.
  • requests.get(url, headers=headers) sends a GET request with the custom header.

Adding headers can help bypass simple anti-scraping measures on some websites. However, it’s crucial to be mindful of a website’s robots.txt file and terms of service to ensure responsible scraping.

20.3.5 Error Handling in Requests

When sending multiple requests in a scraping script, it’s essential to handle potential errors, such as timeouts or connection issues. The requests library provides tools to manage these exceptions.

  • Timeouts: Specify a timeout to prevent the request from hanging indefinitely. If the request takes longer than the specified timeout, it raises a Timeout exception:

    try:
        response = requests.get(url, timeout=5)
    except requests.Timeout:
        print("The request timed out.")
  • Connection Errors: If there’s an issue with the internet connection or server, requests raises a ConnectionError. Handling this allows the script to continue or retry:

    try:
        response = requests.get(url)
    except requests.ConnectionError:
        print("Failed to connect to the server.")

These error-handling strategies improve the reliability of scraping scripts, especially when working with numerous pages.

20.3.6 Example: Retrieving HTML with Requests

The following example demonstrates a complete workflow for retrieving HTML content from a website:

import requests

# Define the target URL and headers
url = "https://example.com"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}

try:
    # Send a GET request with a timeout and headers
    response = requests.get(url, headers=headers, timeout=5)
    
    # Check if the request was successful
    if response.status_code == 200:
        print("Successfully retrieved the page.")
        html_content = response.text  # Extract the HTML content
    else:
        print("Failed to retrieve the page. Status code:", response.status_code)
except requests.Timeout:
    print("The request timed out.")
except requests.ConnectionError:
    print("Connection error occurred.")
Successfully retrieved the page.

This example combines the elements discussed above: setting headers, managing timeouts, checking the status code, and handling errors. Such a setup is foundational for a robust scraping process.

20.3.7 Advanced Requests: POST

While most web scraping involves GET requests, occasionally, POST requests are required. This occurs when interacting with forms or websites requiring data submission (e.g., login forms or search queries). To send a POST request with requests, you provide data as a dictionary using the data parameter:

url = "https://example.com/login"
data = {"username": "user", "password": "pass"}

response = requests.post(url, data=data)

The server processes the submitted data and typically responds with a page reflecting the action taken (e.g., a logged-in user page).

20.4 Parsing HTML with BeautifulSoup

After retrieving a webpage’s HTML content, the next step is to parse and navigate the HTML structure to extract relevant data. The Python library BeautifulSoup, part of the bs4 package, provides a straightforward interface for parsing HTML and XML documents, allowing us to locate and manipulate elements with ease. BeautifulSoup works by converting the HTML into a parse tree structure, which mirrors the nested organization of the HTML’s Document Object Model (DOM).

In this section, we cover the essentials of using BeautifulSoup, including initializing a BeautifulSoup object, locating elements by tags, and using attributes and class names for targeted data extraction.

20.4.1 Setting Up BeautifulSoup

To use BeautifulSoup, you must import it from the bs4 package and initialize it with the HTML content obtained through requests. The BeautifulSoup constructor requires two parameters: the HTML content (usually as a string) and the parser type. The most commonly used parser is html.parser, a built-in Python parser, but lxml and html5lib are also popular for more complex HTML structures.

from bs4 import BeautifulSoup

# Sample HTML content
html_content = "<html><body><h1>Hello, World!</h1></body></html>"

# Initialize a BeautifulSoup object
soup = BeautifulSoup(html_content, "html.parser")

Here, soup becomes the BeautifulSoup object representing the HTML document. We can now navigate through soup to locate specific elements and extract data.

20.4.2 Basic Navigation Methods

BeautifulSoup provides several methods for navigating and selecting elements:

  • .find(): Returns the first instance of a specified tag.
  • .find_all(): Returns a list of all instances of a specified tag.
  • .select(): Allows CSS selector-based searches for more complex selections.

Each of these methods provides different levels of control, depending on whether you need a single element or multiple matching elements.

20.4.3 Locating Elements by Tag Name

The simplest way to extract data is by locating elements using their tag name. For example, to retrieve the first <h1> element in an HTML document, you can use .find():

# Extract the first h1 element
heading = soup.find("h1")
print(heading.text)  
Hello, World!

.find_all() works similarly but returns a list of all matching elements. This is useful when multiple elements share the same tag, such as a series of paragraphs:

# Sample HTML with multiple paragraphs
html_content = """
<html>
  <body>
    <p>First paragraph.</p>
    <p>Second paragraph.</p>
  </body>
</html>
"""
soup = BeautifulSoup(html_content, "html.parser")
paragraphs = soup.find_all("p")

# Print each paragraph's text content
for p in paragraphs:
    print(p.text)
First paragraph.
Second paragraph.

20.4.4 Using Attributes and Classes for Targeted Extraction

In many cases, HTML tags are assigned specific attributes, such as class or id, to organize or style elements. These attributes are helpful for locating elements that share the same tag but need differentiation based on content or position.

  • Class Attribute: Use the class_ parameter in .find() or .find_all() to locate elements with specific classes.

    # Sample HTML with classes
    html_content = '<p class="intro">Introduction paragraph.</p><p class="info">Information paragraph.</p>'
    soup = BeautifulSoup(html_content, "html.parser")
    
    # Extract paragraph with class "intro"
    intro_paragraph = soup.find("p", class_="intro")
    print(intro_paragraph.text)  
    Introduction paragraph.
  • ID Attribute: Since IDs are unique within an HTML document, you can use .find() with the id parameter to locate a specific element.

    # Sample HTML with an ID
    html_content = '<div id="main-content">Main content area.</div>'
    soup = BeautifulSoup(html_content, "html.parser")
    
    main_content = soup.find("div", id="main-content")
    print(main_content.text)
    Main content area.
  • CSS Selectors: For more complex extractions, use .select() with CSS selectors, allowing combinations of tags, classes, and IDs.

    # Sample HTML with nested elements
    html_content = '<div class="container"><p class="info">Information paragraph.</p></div>'
    soup = BeautifulSoup(html_content, "html.parser")
    
    # Use a CSS selector to find the paragraph inside the div with class "container"
    info_paragraph = soup.select(".container .info")[0]
    print(info_paragraph.text)
    Information paragraph.

20.4.5 Extracting Text Content

Once the desired element is located, .text can be used to retrieve the text content between the tags. .text strips away HTML tags, leaving only the inner text.

# Sample HTML
html_content = "<h1>Welcome to Web Scraping</h1>"
soup = BeautifulSoup(html_content, "html.parser")

# Extract the text from the h1 element
heading = soup.find("h1").text
print(heading)
Welcome to Web Scraping

Alternatively, .get_text() offers the same result and can take optional parameters to modify the output, such as removing extra whitespace.

20.4.6 Extracting Attribute Values

In addition to text content, HTML elements often contain attributes (e.g., URLs in <a> tags). To retrieve an attribute’s value, access it like a dictionary:

# Sample HTML with an anchor tag
html_content = '<a href="https://example.com">Example Link</a>'
soup = BeautifulSoup(html_content, "html.parser")

# Extract the URL from the href attribute
link = soup.find("a")
print(link["href"]) 
https://example.com

This technique is particularly useful for gathering links, image sources, or metadata associated with elements.

20.4.7 Practical Example: Extracting a List of Headings

To demonstrate the functionality of BeautifulSoup, here’s a practical example where we extract a list of headings from the Wikipedia home page:

import requests
from bs4 import BeautifulSoup

# Target URL
url = "https://en.wikipedia.org/wiki/Main_Page"

# Send a GET request and parse the HTML
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Find all headlines in h2 tags with the class "mp-h2"
headings = soup.find_all("h2", class_="mp-h2")

# Print each headline text
for heading in headings:
    print(heading.text)
From today's featured article
Did you know ...
In the news
On this day
Today's featured picture
Other areas of Wikipedia
Wikipedia's sister projects
Wikipedia languages

In this example:

  • requests.get(url) retrieves the HTML.
  • soup.find_all("h2", class_="mp-h2") locates all <h2> tags with the class mp-h2.

20.4.8 Handling Nested Elements

Sometimes, desired data is nested within multiple tags. BeautifulSoup allows chaining .find() or .find_all() calls to navigate nested structures.

For example:

# Sample HTML with nested elements
html_content = """
<div class="outer">
    <div class="inner">
        <p>Nested paragraph content.</p>
    </div>
</div>
"""
soup = BeautifulSoup(html_content, "html.parser")

# Locate the nested paragraph within inner div
nested_paragraph = soup.find("div", class_="outer").find("div", class_="inner").find("p")
print(nested_paragraph.text)  
Nested paragraph content.

By chaining .find() calls, we navigate through div.outer to div.inner and finally to the <p> tag containing the text.

20.4.9 Working with HTML Tables

Tables often contain structured data that can be parsed directly. Using .find_all("tr"), you can iterate through each row (<tr>) and extract data from cells (<td>).

# Sample HTML table
html_content = """
<table>
  <tr><td>Row 1, Cell 1</td><td>Row 1, Cell 2</td></tr>
  <tr><td>Row 2, Cell 1</td><td>Row 2, Cell 2</td></tr>
</table>
"""
soup = BeautifulSoup(html_content, "html.parser")

# Parse table rows and cells
rows = soup.find_all("tr")
for row in rows:
    cells = row.find_all("td")
    row_data = [cell.text for cell in cells]
    print(row_data)
['Row 1, Cell 1', 'Row 1, Cell 2']
['Row 2, Cell 1', 'Row 2, Cell 2']

20.5 Exercises

Exercise 1: Basic Parsing and Text Extraction

  • Write a script that retrieves the title text (<h1> tag) from a sample HTML string:

    <html><body><h1>Sample Heading</h1></body></html>
  • Modify the script to print the title’s text content.

Exercise 2: Extracting Multiple Paragraphs

  • Given the following HTML content, use BeautifulSoup to extract and print the text from each paragraph:

    <html>
      <body>
        <p>Paragraph 1 content.</p>
        <p>Paragraph 2 content.</p>
        <p>Paragraph 3 content.</p>
      </body>
    </html>

Exercise 3: Attribute-Based Extraction

  • Given the HTML snippet below, use BeautifulSoup to find and print the text of the paragraph with the class attribute highlight:

    <html>
      <body>
        <p class="highlight">Highlighted text.</p>
        <p>Regular text.</p>
      </body>
    </html>

Exercise 5: Nested Element Extraction

  • Given the following nested HTML structure, use BeautifulSoup to retrieve the text within the innermost paragraph:

    <html>
      <body>
        <div class="outer">
          <div class="inner">
            <p>Deeply nested content.</p>
          </div>
        </div>
      </body>
    </html>

Exercise 6: Parsing Tables

  • Given the HTML table structure below, extract the contents of each cell in each row and print them as lists:

    <table>
      <tr><td>Row 1, Cell 1</td><td>Row 1, Cell 2</td></tr>
      <tr><td>Row 2, Cell 1</td><td>Row 2, Cell 2</td></tr>
    </table>

Exercise 7: Applying CSS Selectors

  • Using BeautifulSoup and CSS selectors, extract the text from all <p> tags with the class attribute info from the following HTML:

    <html>
      <body>
        <p class="info">Info 1</p>
        <p class="info">Info 2</p>
        <p>Other paragraph</p>
      </body>
    </html>