The Power Of Regex For Data Quality Testing

In our previous articles of ETL Testing Series, we explored the different stages of ETL testing, emphasizing the importance of verifying data integrity throughout the transformation pipeline and various aspect of Data Quality Testing. Today, we’ll delve deeper into a powerful tool for data quality testing: Regular Expressions (Regex).

RECAP ON DATA QUALITY TESTING

Data quality testing ensures the data used for analysis is accurate, complete, consistent, and usable. Imagine building a house on a shaky foundation – poor quality data leads to unreliable insights and flawed decisions. Data profiling is also a key step in data quality testing, analyzes data to understand its structure, content, and distribution.

REGULAR EXPRESSIONS: TAMING THE TEXTUAL WILD WEST

Regex in data quality

Ever try finding a specific word in a massive document? That’s where regex, short for regular expression, comes in. Imagine it as a supercharged search tool for text.
They use a concise syntax to define complex patterns, allowing you to efficiently identify specific data elements or validate data formats. It can find things like:

  • Specific words or phrases (like searching for your name in a long email thread).

  • Certain formats or pattern (like phone numbers or words starting or ending with specific string).

Think of it like a detective for text! It sniffs out exactly what you’re looking for (like a cat :D) .

Neko searching regex for data quality testing

So, where do you use this detective? In many places! Here are a few:

  • Finding things fast in a large text block

  • Validation in Online forms, use regex to check if your email address or phone number is formatted properly.

  • Found a typo repeated everywhere? Regex can replace it all at once, saving you tons of time.

  • Got text with weird formatting or extra characters? Regex can help scrub it clean.

  • Many programming languages use regex behind the scenes to manipulate text in clever ways.

This Capabilities make SQL and Regex, a match made in Heaven for Data Quality Testing. Many Data Warehouse platforms support integrating regex within queries. This enables you to leverage regex patterns directly in data validation checks.
Imagine you have a list of email addresses, and you want to find all entries that don’t follow a standard format (e.g., missing the “@” symbol or having an invalid domain name). You can use a regex pattern like below to easily identify these non-conforming email addresses in your list.

Bash

^[a-zA-Z0-9.!#$%&*+/=?^_{|}~-]+@a-zA-Z0-9: ?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9]?(?:.a-zA-Z0-9: ?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9]?)*$`

DATA QUALITY TESTING WITH REGEX: REAL-WORLD EXAMPLES

Let’s explore more real world example where Regex can help with Data Quality Testing for data of an E-Commerce company.
Here are some common data quality tests you can perform using regex with accompanying SQL queries:

1. VALIDATING EMAIL ADDRESSES:

To verify quality of email address, you need to ensure that all values in email columns should satisfy an expected format. That can be achieved with below query which utilizes regex to identify entries in the in email addresses that don’t conform to a standard format.

SQL

SELECT * FROM customers
WHERE email NOT REGEXP '^[a-zA-Z0-9.!#$%&*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$';

2. ENSURING PHONE NUMBER CONSISTENCY:

Phone numbers in any specific region follows predefined formate like digits should be 10 or number should start from 89 or follow “XXX-XXX-XXXX” format. Inclusion of these Data quality validation can be achieved with below regex.

SQL

SELECT * FROM orders
WHERE phone NOT REGEXP '^\d{3}-\d{3}-\d{4}$';

3. IDENTIFYING INVALID POSTAL CODES:

Now assume We need to validate postal codes in Canada. Which should follow format like “A1B 2C3”. Below query employs regex to detect invalid postal codes.

SQL

SELECT * FROM addresses
WHERE postal_code NOT REGEXP '^[A-Za-z]\d[A-Za-z]\s\d[A-Za-z]\d$';

4. ENFORCING DATA UNIQUENESS:

Data quality often requires ensuring specific columns contain unique values. Regex can help identify potential duplicates that might not be exact matches.
Here’s a simplified example:

SQL

SELECT * FROM products
WHERE name REGEXP '(^[a-zA-Z0-9]+)(.*)\1$';

This query targets the “products” table and aims to find rows where the product name might have duplicates with some variation at the end (e.g., “Shirt” vs. “Shirt (Large)”). It achieves this by capturing a pattern at the beginning of the name ((^[a-zA-Z0-9]+)) and checking for its repetition ((.*)\1$). Note that this is a simplified example and may require adjustments based on your specific needs.

5. VALIDATING DATES:

Dates are another common data point requiring format and consistency checks. While some databases have built-in date functions, regex can also be used for basic validation. Here’s an example assuming YYYY-MM-DD format:

SQL

SELECT * FROM transactions
WHERE purchase_date NOT REGEXP '^\d{4}-\d{2}-\d{

6. CHECKING FOR MISSING VALUES IN SPECIFIC COLUMNS:

Below query identifies rows in the “products” table where the “name” or “description” columns might be entirely empty (containing only whitespace characters). Note that this is just a basic example, and you might need to adapt the regex pattern depending on how your system handles empty values.

SQL

SELECT * FROM products
WHERE name REGEXP '^\s*$' OR description REGEXP '^\s*$';

7. VALIDATING DATA LENGTH:

This query ensures usernames in the “users” table fall within a specific length range (between 5 and 20 characters in this case). You can adjust the numbers in the curly braces (.{n,m}) to define your desired minimum and maximum length.

SQL

SELECT * FROM users
WHERE username REGEXP '^.{5,20}$';

8. ENFORCING CONSISTENCY IN CASE SENSITIVITY:

This scenario checks for inconsistencies in case sensitivity within an “articles” table. It searches for titles that start with an uppercase letter (^[A-Z]) and bodies that don’t follow the same pattern. This might indicate a need to enforce a specific case (uppercase or lowercase) for consistency.

SQL

SELECT * FROM articles
WHERE title REGEXP '^[A-Z].*$' AND body NOT REGEXP '^[A-Z].*$';

Remember, these are just a few examples. The possibilities for using regex in data quality testing are vast, and the specific tests you implement will depend on the nature of your data and the quality requirements of your project.
Regular expressions (regex) have proven themselves as a valuable tool for enhancing data quality within your ETL processes. By harnessing the power of regex patterns, you can efficiently validate data formats, identify inconsistencies, and ensure the integrity of your information throughout the transformation pipeline.
However, regex is just one weapon in your data quality arsenal. Remember to explore complementary techniques like data profiling and data cleansing for a holistic approach. As your data ecosystem matures, stay informed about emerging trends in data quality tools and best practices.