import re
= "The dataset is being updated in the database."
text = r"data\w*"
pattern
# Search for the first occurrence of the pattern
= re.search(pattern, text)
result if result:
print("Found:", result.group())
Found: dataset
In programming, we often encounter tasks that require searching, extracting, or validating specific patterns within strings. Regular Expressions (regex or regexp) provide a powerful tool for these tasks by allowing us to define search patterns with precision. Regular expressions are particularly useful in text processing, making it easier to handle tasks such as finding email addresses, extracting numerical values, or validating formatted inputs.
At its core, a regular expression is a sequence of characters that form a pattern. This pattern describes the structure or sequence you are interested in finding within a larger text. Unlike simple string matching, which relies on an exact match (e.g., searching for the word “data” would only find “data” and not “dataset” or “database”), regular expressions allow for more flexible and dynamic matching, capturing partial matches or patterns that meet certain criteria.
For instance, the pattern data\w*
will match:
This flexibility is what makes regular expressions so powerful for text processing.
To build effective regular expressions, it’s helpful to understand the components that make up a pattern. Each character in a regular expression can have a special meaning, enabling more complex pattern definitions.
Literal Characters: Most characters (like letters or numbers) match themselves. For instance, the pattern data
will only match occurrences of the word “data”.
Metacharacters: These are special characters that have specific functions in regex syntax. Some commonly used metacharacters are:
.
(dot): Matches any single character except a newline.^
(caret): Indicates the start of a string.$
(dollar): Indicates the end of a string.*
, +
, ?
: Define the frequency or repetition of patterns.Character Classes: Defined by brackets [ ]
, character classes allow you to match any one of a set of characters.
[abc]
will match either “a”, “b”, or “c”.[0-9]
to match any digit.Escape Sequences: Some characters (such as .
or *
) have special meanings, so you may need to escape them with a backslash (\
) if you want to match them literally. For instance, \.
will match a period rather than any character.
Quantifiers: These symbols define how many times a character or pattern must appear.
*
matches zero or more occurrences.+
matches one or more occurrences.?
matches zero or one occurrence.Regular expressions are particularly useful when handling unstructured data, such as:
Regular expressions are versatile enough to support a wide variety of tasks in text processing, making them invaluable in fields like data science, web development, and natural language processing.
Python’s re
library is designed to work with regular expressions, providing several powerful functions for pattern matching and text manipulation. Below, we explore some of the key functions in the re
library, each accompanied by examples to demonstrate their practical applications.
re
Libraryre.search()
: This function searches for the first occurrence of a pattern within a string. If it finds a match, it returns a match object; otherwise, it returns None
.
re.findall()
: This function finds all non-overlapping matches of a pattern in a string and returns them as a list.
re.match()
: This function checks for a match only at the beginning of a string.
re.sub()
: This function replaces occurrences of a pattern with a specified replacement string.
re.split()
: This function splits a string by occurrences of a pattern, returning a list of substrings.
Each function serves a unique purpose and can be applied to various text-processing tasks.
re.search()
The re.search()
function finds the first match of a pattern within a string. This is useful when you only need to confirm the existence of a pattern or retrieve the first match.
Example 1: Searching for a word that starts with “data”
import re
= "The dataset is being updated in the database."
text = r"data\w*"
pattern
# Search for the first occurrence of the pattern
= re.search(pattern, text)
result if result:
print("Found:", result.group())
Found: dataset
In this example: - The pattern data\w*
matches “data” followed by zero or more word characters (\w*
). - The result.group()
method returns the matched string.
Example 2: Checking for the presence of a phone number pattern
= "Contact us at 123-867-5309."
text = r"\d{3}-\d{3}-\d{4}"
pattern
= re.search(pattern, text)
result if result:
print("Phone number found:", result.group())
Phone number found: 123-867-5309
The \d
metacharacter matches any digit and is equivalent to [0-9]
. When followed by {3}
, then we are looking for three consecutive digits.
re.findall()
The re.findall()
function returns all matches of a pattern as a list, making it ideal for finding multiple occurrences of a pattern within a string.
Example 1: Extracting all email addresses
= "Emails: alice@example.com, bob@example.org, charlie@test.com"
text = r"\b\w+@\w+\.\w+\b"
pattern
= re.findall(pattern, text)
emails print("Emails found:", emails)
Emails found: ['alice@example.com', 'bob@example.org', 'charlie@test.com']
In this example:
\b\w+@\w+\.\w+\b
matches email addresses by looking for word characters around the “@” and “.” symbols.re.findall()
captures all email addresses in the text.Example 2: Finding all words starting with “stat”
= "Statistics is a fascinating field, with stats and statistical methods widely applied."
text = r"\bstat\w*"
pattern
= re.findall(pattern, text)
matches print("Words found:", matches)
Words found: ['stats', 'statistical']
re.match()
The re.match()
function checks if a pattern matches only at the beginning of a string, returning None
if the pattern appears elsewhere.
Example 1: Verifying the format of a postal code at the beginning of a string
= "12345-6789 is the postal code."
text = r"^\d{5}-\d{4}"
pattern
if re.match(pattern, text):
print("Valid postal code format.")
else:
print("Invalid format.")
Valid postal code format.
Here:
^\d{5}-\d{4}
matches a 5-digit number, a hyphen, and a 4-digit number, specifically at the start of the string.re.sub()
The re.sub()
function replaces all occurrences of a pattern with a specified replacement string, making it useful for sanitizing or formatting data.
Example 1: Replacing phone numbers with “[PHONE]”
= "Reach us at 123-456-7890 or 987-654-3210."
text = r"\d{3}-\d{3}-\d{4}"
pattern
= re.sub(pattern, "[PHONE]", text)
new_text print(new_text)
Reach us at [PHONE] or [PHONE].
Example 2: Standardizing date formats
= "Event on 2023-01-25, follow-up on 01/26/2023."
text = r"(\d{4})-(\d{2})-(\d{2})"
pattern
# Reformat dates to MM/DD/YYYY
= re.sub(pattern, r"\2/\3/\1", text)
new_text print(new_text)
Event on 01/25/2023, follow-up on 01/26/2023.
In this example:
(\d{4})-(\d{2})-(\d{2})
captures the year, month, and day.\2/\3/\1
reorders the components to MM/DD/YYYY.re.split()
The re.split()
function splits a string based on a regular expression, which is useful when splitting by complex delimiters.
Example 1: Splitting text by multiple delimiters
= "apple; orange, banana|grape"
text = r"[;,|]"
pattern
= re.split(pattern, text)
fruits print(fruits)
['apple', ' orange', ' banana', 'grape']
Here, the pattern [;,|]
matches any one of ;
, ,
, or |
as delimiters.
Here’s a list of commonly used special characters (also called escape sequences or metacharacters) in regular expressions, along with their functions:
^
: Matches the start of a string.$
: Matches the end of a string.\b
: Matches a word boundary (position between a word and a non-word character).\B
: Matches a position that is not a word boundary..
: Matches any character except a newline.\d
: Matches any digit, equivalent to [0-9]
.\D
: Matches any non-digit character, equivalent to [^0-9]
.\w
: Matches any word character (alphanumeric or underscore), equivalent to [a-zA-Z0-9_]
.\W
: Matches any non-word character, equivalent to [^a-zA-Z0-9_]
.\s
: Matches any whitespace character (spaces, tabs, newlines).\S
: Matches any non-whitespace character.*
: Matches 0 or more occurrences of the preceding element.+
: Matches 1 or more occurrences of the preceding element.?
: Matches 0 or 1 occurrence of the preceding element.{n}
: Matches exactly n
occurrences of the preceding element.{n,}
: Matches n
or more occurrences.{n,m}
: Matches between n
and m
occurrences.( )
: Groups a pattern together, allowing you to apply quantifiers to the entire group or capture matches.|
: Alternation operator, meaning “or” (e.g., data|info
matches “data” or “info”).\
: Escapes special characters, allowing them to be used as literal characters (e.g., \.
matches a period).\1
, \2
, etc.: Refers to matched groups in the pattern, useful for back-references.(?=...)
: Positive lookahead, ensures that what follows matches ...
.(?!...)
: Negative lookahead, ensures that what follows does not match ...
.(?<=...)
: Positive lookbehind, ensures that what precedes matches ...
.(?<!...)
: Negative lookbehind, ensures that what precedes does not match ...
.\t
: Matches a tab character.\n
: Matches a newline character.\r
: Matches a carriage return character.\f
: Matches a form feed character.\v
: Matches a vertical tab character.\0
: Matches the null character.Regular expressions become particularly powerful when we combine multiple elements to create complex patterns. This section explores some advanced techniques to build sophisticated expressions, allowing for precise control over pattern matching.
Grouping is achieved by enclosing parts of a pattern within parentheses ()
. Groups allow you to:
Example: Capturing parts of a date
import re
= "Today's date is 2024-11-03."
text = r"(\d{4})-(\d{2})-(\d{2})"
pattern
= re.search(pattern, text)
match if match:
= match.groups()
year, month, day print(f"Year: {year}, Month: {month}, Day: {day}")
Year: 2024, Month: 11, Day: 03
In this example:
(\d{4})-(\d{2})-(\d{2})
captures the year, month, and day as separate groups.match.groups()
method returns a tuple with the captured parts.Lookahead and lookbehind assertions allow you to match a pattern based on what follows or precedes it, without including that part in the match.
(?=...)
: Asserts that a match is followed by a specific pattern.(?!...)
: Asserts that a match is not followed by a specific pattern.(?<=...)
: Asserts that a match is preceded by a specific pattern.(?<!...)
: Asserts that a match is not preceded by a specific pattern.Example 1: Finding words followed by “ing” (positive lookahead)
= "The following items are walking, talking, and reading."
text = r"\b\w+(?=ing\b)"
pattern
= re.findall(pattern, text)
matches print(matches)
['follow', 'walk', 'talk', 'read']
Example 2: Finding numbers not preceded by a dollar sign (negative lookbehind)
= "Price is $100 but I paid 150."
text = r"(?<!\$)\b\d+\b"
pattern
= re.findall(pattern, text)
matches print(matches) # Output: ['150']
['150']
|
OperatorThe |
operator allows you to specify multiple patterns, matching if any of the alternatives are found.
Example: Matching multiple file extensions
= "Files: report.pdf, image.jpg, document.docx"
text = r"\b\w+\.(pdf|jpg|docx)\b"
pattern
= re.findall(pattern, text)
matches print(matches)
['pdf', 'jpg', 'docx']
Here, the pattern matches any word followed by a file extension (either “pdf”, “jpg”, or “docx”).
Character classes [ ]
allow for more flexibility by matching any one character within the brackets. You can also define ranges within classes to match multiple characters more succinctly.
[a-z]
matches any lowercase letter, while [0-9]
matches any digit.[^...]
matches any character not in the brackets.Example 1: Matching only vowels
= "Regular expressions are useful."
text = r"[aeiou]"
pattern
= re.findall(pattern, text)
vowels print(vowels)
['e', 'u', 'a', 'e', 'e', 'i', 'o', 'a', 'e', 'u', 'e', 'u']
Example 2: Matching any character except vowels
= r"[^aeiou]"
pattern
= re.findall(pattern, text)
non_vowels print(non_vowels)
['R', 'g', 'l', 'r', ' ', 'x', 'p', 'r', 's', 's', 'n', 's', ' ', 'r', ' ', 's', 'f', 'l', '.']
Backreferences allow you to reuse a captured group within the same pattern. This is especially useful for finding repeated patterns.
Example: Matching repeated words
= "The the test was a success."
text = r"\b(\w+)\s+\1\b"
pattern
= re.findall(pattern, text, re.IGNORECASE)
matches print(matches)
['The']
Here: - (\w+)
captures a word, and \1
references this captured group, matching any repeated word.
Quantifiers, such as *
, +
, and {n,m}
, allow you to control the frequency of matched elements, providing even more flexibility for complex patterns.
*
: Matches zero or more occurrences.+
: Matches one or more occurrences.?
: Matches zero or one occurrence.{n}
: Matches exactly n
occurrences.{n,}
: Matches n
or more occurrences.{n,m}
: Matches between n
and m
occurrences.Example 1: Matching sequences of digits with varying lengths
= "Order numbers: 123, 4567, 89, 23456"
text = r"\b\d{2,4}\b"
pattern
= re.findall(pattern, text)
matches print(matches)
['123', '4567', '89']
This pattern matches numbers that are 2 to 4 digits long.
By combining groups, lookaheads, alternation, and quantifiers, you can construct highly specific patterns.
Example: Extracting full names in the format “Last, First M.”
= "Attendees: Smith, John A.; Doe, Jane B."
text = r"\b([A-Z][a-z]+),\s([A-Z][a-z]+)\b"
pattern
= re.findall(pattern, text)
names print(names)
[('Smith', 'John'), ('Doe', 'Jane')]
Here:
[A-Z][a-z]+
matches capitalized words.\s
matches spaces.www.something.com
, something.org
, or something.edu
."Visit us at www.example.com or go to research.org for more info."
['example.com', 'research.org']
re.sub()
to remove all punctuation from a sentence, leaving only alphanumeric characters and spaces."Hello, world! Let's test this: are you ready?"
"Hello world Lets test this are you ready"
MM/DD/YYYY
, YYYY-MM-DD
, and DD.MM.YYYY
."We met on 2024-11-03, and we’ll meet again on 11/03/2024."
['2024-11-03', '11/03/2024']
"She sells seashells by the seashore."
['She', 'sells', 'seashells', 'seashore']
re.sub()
to replace all Social Security numbers (formatted as ###-##-####
) with [REDACTED]
."Client's SSN is 123-45-6789."
"Client's SSN is [REDACTED]."
@
, #
, !
, $
).Password123!
weakpass
"Dr. Smith, John A. and Dr. Brown, Lisa attended the meeting."
[('Smith', 'John', 'A.'), ('Brown', 'Lisa', '')]
"This is a test test sentence."
['test']
"Apples; oranges, and bananas are tasty."
['Apples', 'oranges', 'and', 'bananas', 'are', 'tasty']
.pdf
, .docx
, or .xlsx
extensions."Documents include report.pdf, summary.docx, and data.xlsx."
['report.pdf', 'summary.docx', 'data.xlsx']
"What is your name? My name is Alice. Where are you from?"
['My name is Alice']