import requests
= "https://example.com"
url = requests.get(url) response
20 Web Scraping
20.1 Introduction to Web Scraping
Web scraping, or data extraction from websites, is a powerful technique that allows users to programmatically collect data from web pages and use it for a range of applications, from statistical analysis to real-time data-driven applications. This process enables us to take unstructured data in HTML format and transform it into a structured form that is suitable for further analysis and computation in Python.
Web scraping is especially valuable when data is not readily available via a standard API (Application Programming Interface), or if the data exists only on web pages. By automating the retrieval and organization of this information, we can work more efficiently and even capture data over time to analyze trends and patterns.
20.1.1 Applications of Web Scraping
Web scraping is used across various fields, including:
- Market Research and Competitive Analysis: Organizations monitor competitors’ websites to gather data on pricing, product availability, customer reviews, and more.
- Real-Time Data Gathering: For example, scraping financial or weather data in real time allows businesses to react to market changes or environmental conditions promptly.
- Academic Research: Researchers use web scraping to collect large datasets from scientific journals, social media, or news sites for data-driven studies.
- Data Aggregation: Many websites, such as travel aggregators, rely on scraping data from various sources to provide users with comparative data on flight prices, hotel rates, etc.
20.1.2 Core Libraries for Web Scraping
Python provides a range of libraries for web scraping, but two foundational tools often used together are requests
and BeautifulSoup
.
requests
: This library simplifies the process of sending HTTP requests, such asGET
requests, to retrieve HTML data from a specified URL. By providing a simple interface,requests
handles much of the complexity associated with connecting to web servers.BeautifulSoup
: A parsing library built specifically for handling HTML and XML files. BeautifulSoup allows us to navigate the structure of an HTML document, search for elements by tags or attributes, and extract content.
By combining requests
to access the HTML and BeautifulSoup
to parse it, we can build powerful tools for extracting data from a variety of websites. These libraries, along with basic knowledge of HTML, enable us to automate data extraction for a wide range of applications.
20.1.3 Web Scraping Process Overview
Web scraping typically involves the following steps:
Identify the Target Data: Determine which website and specific elements (such as tables, links, images) contain the data of interest. Familiarizing yourself with the website’s HTML structure helps make the extraction process more efficient.
Send a Request to Retrieve HTML Content: Use
requests
to access the webpage and download its HTML structure. Ensuring the status code is200
(OK) indicates a successful request.Parse the HTML: With BeautifulSoup, parse the retrieved HTML. This allows you to locate and extract the content within specific tags or classes.
Data Cleaning and Storage: After extraction, clean and organize the data for analysis. This often involves formatting the data, handling missing values, and saving it to a structured format like a CSV file or a database.
20.1.4 Legal and Ethical Considerations
While web scraping is powerful, it is essential to consider the ethical and legal implications. Many websites publish guidelines on scraping in a robots.txt
file, which suggests which areas of a site can or cannot be accessed by automated agents. Some sites may strictly prohibit scraping, while others may limit access to specific data.
Key points to consider:
Respect
robots.txt
: Review therobots.txt
file for scraping restrictions, available at the root URL, e.g.,https://example.com/robots.txt
.Avoid Excessive Requests: Sending too many requests in a short period can overload servers and lead to IP blocking. Introducing pauses between requests helps to avoid this.
Read the Terms of Service: Many sites’ terms of service specify acceptable data usage. Scraping that violates these terms can have legal consequences.
20.2 HTML Basics
To effectively perform web scraping, it is essential to understand the structure of HTML (HyperText Markup Language), the core language that defines the content and layout of web pages. HTML organizes content using a system of nested elements called tags, each with specific purposes and attributes. Recognizing the role of these tags helps us locate and extract the desired data with precision.
20.2.1 Key Components of HTML
HTML documents consist of several fundamental components, including elements, attributes, and text content. Understanding these elements provides the foundation needed for successful web scraping.
Elements: HTML elements are the building blocks of a webpage, defined by tags. A tag indicates the start of an element (e.g.,
<p>
for a paragraph) and often requires a closing tag (e.g.,</p>
).<p>This is a paragraph.</p>
Common tags include
<h1>
through<h6>
for headings,<p>
for paragraphs,<a>
for hyperlinks, and<div>
for division or grouping of content.Attributes: HTML attributes provide additional information about an element, often helping uniquely identify or style specific elements. Attributes appear within the opening tag of an element.
<p class="intro">This paragraph has a class attribute.</p>
In this example,
class="intro"
is an attribute that assigns a class to the paragraph. Theclass
andid
attributes are particularly useful in web scraping because they allow us to precisely locate elements in a webpage’s HTML structure.Text Content: The content between the opening and closing tags of an element is referred to as text content. This is the primary information that we aim to extract in web scraping.
<h1>Web Scraping Basics</h1>
Here, “Web Scraping Basics” is the text content within the
<h1>
tag.
20.2.4 Using Attributes for Targeted Extraction
In web scraping, attributes like id
and class
are invaluable for pinpointing specific elements, especially when scraping data from complex or crowded pages.
class
Attribute: Often used to apply styles to groups of elements, theclass
attribute can also serve as an identifier when multiple elements share the same styling.<p class="info">This is a paragraph with the class 'info'.</p>
id
Attribute: Unlikeclass
, anid
attribute is unique to a single element on the page, making it useful for uniquely identifying specific elements.<div id="main-content">This is the main content area.</div>
Combining Tags and Attributes: In practice, you can target specific elements by using both the tag name and attributes in combination. For example, with BeautifulSoup, locating an element with a specific
class
orid
is straightforward:= soup.find("div", id="main-content") content
By understanding and utilizing these HTML elements and attributes, we can refine our web scraping techniques to extract data accurately and efficiently.
20.3 Sending HTTP Requests with requests
To extract data from a web page, the first step in web scraping is to retrieve the page’s HTML content. This process involves sending an HTTP request to the server hosting the page, which, in response, sends back the HTML structure and content. The Python library requests
makes this process straightforward by providing a clean interface for sending HTTP requests, retrieving responses, and handling various request types.
The HTTP protocol supports several types of requests, with the two most common being:
- GET requests: Retrieve data from a specified resource. In web scraping,
GET
requests are used to fetch HTML pages from web servers. - POST requests: Send data to a server to create or update a resource. While commonly used in forms and other data-submission processes,
POST
requests are less common in basic web scraping.
In this section, we focus on GET
requests, as they are typically used for retrieving static HTML content for scraping.
20.3.1 Basic Syntax for Sending a GET Request
The requests
library makes it easy to send an HTTP GET request. Here’s the basic syntax:
In this example:
url
is the target webpage’s address.requests.get(url)
sends a GET request to the URL, retrieving the HTML content.response
stores the server’s response, including the HTML.
After sending the request, you can inspect various components of the response
object, such as the HTML content, response status, and response headers.
20.3.2 Inspecting the Response Status Code
When we send a request, the server responds with a status code indicating whether the request was successful. Common HTTP status codes include:
- 200 (OK): The request was successful, and the server returned the requested content.
- 404 (Not Found): The server could not find the requested resource.
- 500 (Internal Server Error): The server encountered an error.
Checking the status code helps determine whether the request was successful before proceeding with HTML parsing. You can access the status code using the .status_code
attribute of the response
object:
if response.status_code == 200:
print("Request successful!")
else:
print("Failed to retrieve the page. Status code:", response.status_code)
Request successful!
20.3.3 Retrieving HTML Content from the Response
If the status code indicates success (200), you can proceed to extract the HTML content using response.text
, which provides the HTML as a string:
= response.text
html_content print(html_content)
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
Alternatively, response.content
provides the raw byte data, which is useful if you’re working with non-HTML files (e.g., images or PDFs). For standard HTML scraping, however, response.text
is typically more convenient.
20.3.4 Handling Request Headers and User Agents
Some websites use mechanisms to detect automated requests and may block them if they suspect scraping activity. One common strategy to avoid this is to modify the user-agent in the request header, which indicates the type of browser or client making the request.
The requests
library allows you to add headers, including a custom user-agent, to mimic a legitimate browser:
= {
headers "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}= requests.get(url, headers=headers) response
In this example:
headers
defines a dictionary with theUser-Agent
key, mimicking a browser like Chrome.requests.get(url, headers=headers)
sends a GET request with the custom header.
Adding headers can help bypass simple anti-scraping measures on some websites. However, it’s crucial to be mindful of a website’s robots.txt
file and terms of service to ensure responsible scraping.
20.3.5 Error Handling in Requests
When sending multiple requests in a scraping script, it’s essential to handle potential errors, such as timeouts or connection issues. The requests
library provides tools to manage these exceptions.
Timeouts: Specify a timeout to prevent the request from hanging indefinitely. If the request takes longer than the specified timeout, it raises a
Timeout
exception:try: = requests.get(url, timeout=5) response except requests.Timeout: print("The request timed out.")
Connection Errors: If there’s an issue with the internet connection or server,
requests
raises aConnectionError
. Handling this allows the script to continue or retry:try: = requests.get(url) response except requests.ConnectionError: print("Failed to connect to the server.")
These error-handling strategies improve the reliability of scraping scripts, especially when working with numerous pages.
20.3.6 Example: Retrieving HTML with Requests
The following example demonstrates a complete workflow for retrieving HTML content from a website:
import requests
# Define the target URL and headers
= "https://example.com"
url = {
headers "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
try:
# Send a GET request with a timeout and headers
= requests.get(url, headers=headers, timeout=5)
response
# Check if the request was successful
if response.status_code == 200:
print("Successfully retrieved the page.")
= response.text # Extract the HTML content
html_content else:
print("Failed to retrieve the page. Status code:", response.status_code)
except requests.Timeout:
print("The request timed out.")
except requests.ConnectionError:
print("Connection error occurred.")
Successfully retrieved the page.
This example combines the elements discussed above: setting headers, managing timeouts, checking the status code, and handling errors. Such a setup is foundational for a robust scraping process.
20.3.7 Advanced Requests: POST
While most web scraping involves GET
requests, occasionally, POST
requests are required. This occurs when interacting with forms or websites requiring data submission (e.g., login forms or search queries). To send a POST
request with requests
, you provide data as a dictionary using the data
parameter:
= "https://example.com/login"
url = {"username": "user", "password": "pass"}
data
= requests.post(url, data=data) response
The server processes the submitted data and typically responds with a page reflecting the action taken (e.g., a logged-in user page).
20.4 Parsing HTML with BeautifulSoup
After retrieving a webpage’s HTML content, the next step is to parse and navigate the HTML structure to extract relevant data. The Python library BeautifulSoup
, part of the bs4
package, provides a straightforward interface for parsing HTML and XML documents, allowing us to locate and manipulate elements with ease. BeautifulSoup
works by converting the HTML into a parse tree structure, which mirrors the nested organization of the HTML’s Document Object Model (DOM).
In this section, we cover the essentials of using BeautifulSoup
, including initializing a BeautifulSoup
object, locating elements by tags, and using attributes and class names for targeted data extraction.
20.4.1 Setting Up BeautifulSoup
To use BeautifulSoup
, you must import it from the bs4
package and initialize it with the HTML content obtained through requests
. The BeautifulSoup
constructor requires two parameters: the HTML content (usually as a string) and the parser type. The most commonly used parser is html.parser
, a built-in Python parser, but lxml
and html5lib
are also popular for more complex HTML structures.
from bs4 import BeautifulSoup
# Sample HTML content
= "<html><body><h1>Hello, World!</h1></body></html>"
html_content
# Initialize a BeautifulSoup object
= BeautifulSoup(html_content, "html.parser") soup
Here, soup
becomes the BeautifulSoup
object representing the HTML document. We can now navigate through soup
to locate specific elements and extract data.
20.4.3 Locating Elements by Tag Name
The simplest way to extract data is by locating elements using their tag name. For example, to retrieve the first <h1>
element in an HTML document, you can use .find()
:
# Extract the first h1 element
= soup.find("h1")
heading print(heading.text)
Hello, World!
.find_all()
works similarly but returns a list of all matching elements. This is useful when multiple elements share the same tag, such as a series of paragraphs:
# Sample HTML with multiple paragraphs
= """
html_content <html>
<body>
<p>First paragraph.</p>
<p>Second paragraph.</p>
</body>
</html>
"""
= BeautifulSoup(html_content, "html.parser")
soup = soup.find_all("p")
paragraphs
# Print each paragraph's text content
for p in paragraphs:
print(p.text)
First paragraph.
Second paragraph.
20.4.4 Using Attributes and Classes for Targeted Extraction
In many cases, HTML tags are assigned specific attributes, such as class
or id
, to organize or style elements. These attributes are helpful for locating elements that share the same tag but need differentiation based on content or position.
Class Attribute: Use the
class_
parameter in.find()
or.find_all()
to locate elements with specific classes.# Sample HTML with classes = '<p class="intro">Introduction paragraph.</p><p class="info">Information paragraph.</p>' html_content = BeautifulSoup(html_content, "html.parser") soup # Extract paragraph with class "intro" = soup.find("p", class_="intro") intro_paragraph print(intro_paragraph.text)
Introduction paragraph.
ID Attribute: Since IDs are unique within an HTML document, you can use
.find()
with theid
parameter to locate a specific element.# Sample HTML with an ID = '<div id="main-content">Main content area.</div>' html_content = BeautifulSoup(html_content, "html.parser") soup = soup.find("div", id="main-content") main_content print(main_content.text)
Main content area.
CSS Selectors: For more complex extractions, use
.select()
with CSS selectors, allowing combinations of tags, classes, and IDs.# Sample HTML with nested elements = '<div class="container"><p class="info">Information paragraph.</p></div>' html_content = BeautifulSoup(html_content, "html.parser") soup # Use a CSS selector to find the paragraph inside the div with class "container" = soup.select(".container .info")[0] info_paragraph print(info_paragraph.text)
Information paragraph.
20.4.5 Extracting Text Content
Once the desired element is located, .text
can be used to retrieve the text content between the tags. .text
strips away HTML tags, leaving only the inner text.
# Sample HTML
= "<h1>Welcome to Web Scraping</h1>"
html_content = BeautifulSoup(html_content, "html.parser")
soup
# Extract the text from the h1 element
= soup.find("h1").text
heading print(heading)
Welcome to Web Scraping
Alternatively, .get_text()
offers the same result and can take optional parameters to modify the output, such as removing extra whitespace.
20.4.6 Extracting Attribute Values
In addition to text content, HTML elements often contain attributes (e.g., URLs in <a>
tags). To retrieve an attribute’s value, access it like a dictionary:
# Sample HTML with an anchor tag
= '<a href="https://example.com">Example Link</a>'
html_content = BeautifulSoup(html_content, "html.parser")
soup
# Extract the URL from the href attribute
= soup.find("a")
link print(link["href"])
https://example.com
This technique is particularly useful for gathering links, image sources, or metadata associated with elements.
20.4.7 Practical Example: Extracting a List of Headings
To demonstrate the functionality of BeautifulSoup
, here’s a practical example where we extract a list of headings from the Wikipedia home page:
import requests
from bs4 import BeautifulSoup
# Target URL
= "https://en.wikipedia.org/wiki/Main_Page"
url
# Send a GET request and parse the HTML
= requests.get(url)
response = BeautifulSoup(response.text, "html.parser")
soup
# Find all headlines in h2 tags with the class "mp-h2"
= soup.find_all("h2", class_="mp-h2")
headings
# Print each headline text
for heading in headings:
print(heading.text)
From today's featured article
Did you know ...
In the news
On this day
Today's featured picture
Other areas of Wikipedia
Wikipedia's sister projects
Wikipedia languages
In this example:
requests.get(url)
retrieves the HTML.soup.find_all("h2", class_="mp-h2")
locates all<h2>
tags with the classmp-h2
.
20.4.8 Handling Nested Elements
Sometimes, desired data is nested within multiple tags. BeautifulSoup
allows chaining .find()
or .find_all()
calls to navigate nested structures.
For example:
# Sample HTML with nested elements
= """
html_content <div class="outer">
<div class="inner">
<p>Nested paragraph content.</p>
</div>
</div>
"""
= BeautifulSoup(html_content, "html.parser")
soup
# Locate the nested paragraph within inner div
= soup.find("div", class_="outer").find("div", class_="inner").find("p")
nested_paragraph print(nested_paragraph.text)
Nested paragraph content.
By chaining .find()
calls, we navigate through div.outer
to div.inner
and finally to the <p>
tag containing the text.
20.4.9 Working with HTML Tables
Tables often contain structured data that can be parsed directly. Using .find_all("tr")
, you can iterate through each row (<tr>
) and extract data from cells (<td>
).
# Sample HTML table
= """
html_content <table>
<tr><td>Row 1, Cell 1</td><td>Row 1, Cell 2</td></tr>
<tr><td>Row 2, Cell 1</td><td>Row 2, Cell 2</td></tr>
</table>
"""
= BeautifulSoup(html_content, "html.parser")
soup
# Parse table rows and cells
= soup.find_all("tr")
rows for row in rows:
= row.find_all("td")
cells = [cell.text for cell in cells]
row_data print(row_data)
['Row 1, Cell 1', 'Row 1, Cell 2']
['Row 2, Cell 1', 'Row 2, Cell 2']
20.5 Exercises
Exercise 1: Basic Parsing and Text Extraction
Write a script that retrieves the title text (
<h1>
tag) from a sample HTML string:<html><body><h1>Sample Heading</h1></body></html>
Modify the script to print the title’s text content.
Exercise 2: Extracting Multiple Paragraphs
Given the following HTML content, use
BeautifulSoup
to extract and print the text from each paragraph:<html> <body> <p>Paragraph 1 content.</p> <p>Paragraph 2 content.</p> <p>Paragraph 3 content.</p> </body> </html>
Exercise 3: Attribute-Based Extraction
Given the HTML snippet below, use
BeautifulSoup
to find and print the text of the paragraph with theclass
attributehighlight
:<html> <body> <p class="highlight">Highlighted text.</p> <p>Regular text.</p> </body> </html>
Exercise 4: Extracting Links
Write a script to retrieve all hyperlinks (
<a>
tags) from a sample HTML document, printing both the text andhref
attribute for each link:<html> <body> <a href="https://example.com/page1">Page 1</a> <a href="https://example.com/page2">Page 2</a> </body> </html>
Exercise 5: Nested Element Extraction
Given the following nested HTML structure, use
BeautifulSoup
to retrieve the text within the innermost paragraph:<html> <body> <div class="outer"> <div class="inner"> <p>Deeply nested content.</p> </div> </div> </body> </html>
Exercise 6: Parsing Tables
Given the HTML table structure below, extract the contents of each cell in each row and print them as lists:
<table> <tr><td>Row 1, Cell 1</td><td>Row 1, Cell 2</td></tr> <tr><td>Row 2, Cell 1</td><td>Row 2, Cell 2</td></tr> </table>
Exercise 7: Applying CSS Selectors
Using
BeautifulSoup
and CSS selectors, extract the text from all<p>
tags with theclass
attributeinfo
from the following HTML:<html> <body> <p class="info">Info 1</p> <p class="info">Info 2</p> <p>Other paragraph</p> </body> </html>