ETL: Transformation Layer - A Tester's Perspective

As a tester, We need to ensure the accuracy and consistency of information flowing through the ETL pipeline. In previous article we explored Testing of Extraction process but In the heart of this process lies the transformation layer, where raw data gets sculpted into meaningful insights. Today, let's put on our testing hats and explore strategies for testing of transformation layer of ETL.

The Transformation Stage: Where Raw Becomes Refined

Imagine a bustling logistics company like Acme Deliveries. Orders pour in from various sources – emails, phone calls, online forms – each with its own data format. To understand customer preferences and optimize operations, Acme needs to extract this data from different sources and transform it into a unified format for their data warehouse.

elt testing of transformation layer with logistic company example

This is where the transformation layer works its magic, cleaning, standardizing, and manipulating data to meet specific needs.

Transformation Types: A Tester's Toolkit

Lets see what are the different type of transformation that can happen in in middle layer of ETL pipeline and how testing can be done on them. Although the list is quite long but lets go over a few of the most common types of transformations:

1. Filtering:

Now lets assume, Acme wants to analyze only shipped orders. That require ETL to filter only shipped orders records from source. It can be verified in 2 different ways.

  • Validating: It can be tested by ensuring only orders with "Shipped" status are included in output table.

  • Checking exclusions: Verifying no "Pending" or "Cancelled" orders slip through.

SELECT * FROM Orders WHERE OrderStatus != 'Shipped';
--or 
SELECT * FROM Orders WHERE OrderStatus in ('Pending','Cancelled');

2. Data Mapping:

Data mapping is simply converting data from one format to another, like changing units or maybe mapping values to code.

For example, the status of deliveries need to be mapped to code. I.e. 'In Transit' => 'T' or 'Delivered' => 'D'

  • Validation: Verify by joining source and output table and validate that each status is mapped correctly in output table.
select s.orderid, s.order_status, o.order_status
from source_orders s join output_orders o on s.orderid = o.orderid
where 
s.order_status = 'Delivered' and o.order_status != 'D'

3. Cleaning:

Cleaning involves tasks like correcting formatting inconsistencies, handling missing values, and removing duplicates. The company may receive orders data from various sources, such as physical stores, online channels, and third-party marketplaces. They Might need to deduplicate the orders to make analysis more accurate.

  • Validation: This can be verified by making sure, there are no duplicate orderid in ouput table.
select orderid, count(1) from output_orders 
group by orderid having count(1) > 1

4. Aggregation:

It is a key ETL transformation that combines data points, such as creating summaries and generate meaningful insights. Now suppose Acme Deliveries wants to analyze overall delivery performance by region, calculating average delivery time, total deliveries, and on-time delivery percentages.

Let's see how it can be verified.

with
exp_value as ( 
SELECT region, AVG(delivery_time) AS avg_delivery_time, 
COUNT(*) AS total_deliveries, 
(SUM(CASE WHEN delivery_status = 'on_time' THEN 1 ELSE 0 END) / COUNT(*)) * 100 AS on_time_percentage 
FROM delivery_data GROUP BY region
)
select count(1) from deliveries_performance a 
join exp_value b on a.region = b.region
Where a.avg_delivery_time != b.avg_delivery_time
or a.total_deliveries != b.total_deliveries
or a.on_time_percentage != b.on_time_percentage

5. Joining:

It is a transformation of merging data from multiple tables based on shared attributes.
For example, Company wants to combine product’s detail data with orders table which only has product_id.

  • Validation: Ensuring accurate linking between Product ID in both tables.

  • Data integrity: Verifying no missing or duplicate records after joining.

select o.orderid, p.product_id, o.product_desc from output_orders o
join product_details p on p.product_id = o.product_id
where p.product_desc != o.product_desc

6. Data Derivation:

Derivation is creating new data points based on calculations or formulas applied to existing data. Suppose Acme offers discounts for high-value orders but in source we only have order_amount and discount percentage. It wants to add discount_order_amount in output table.

  • Validating: Verifying the discount amount is applied correctly based on order value.
select order_amount, discount, discount_order_amount from output_orders
where order_value > 100 
and discount_order_amount != order_amount * discount

7. Standardisation

It is a process of ensuring consistent data formats and values across different sources. For instance in CRM there is no defined format for customer name and business wants it to be standardised to follow title case. Let's see how we can verify it.

  • Validating: Verify that all values for customer_name column should follow title case. It Can be done by using regex.
select customer_name from output_customers 
where customer_name not regexp_like(%s, '[A-Z][a-z]*')

Best Practices for Transformation Testing:

Although there is no specific best practices to guarantee success and you have to come up with your own strategy according to application under test, deadlines etc. But there are a few which can help you in testing of transformation layer in ETL:

  • Stage your tests: Run tests in isolated environments that doesn't have frequent deployment or ETL job is running without you'r being aware of.

  • Map rules to test cases: Clearly define how each transformation rule translates into specific test scenarios.

  • Involve stakeholders: Collaborate with developers, data analysts, and business users to ensure tests align with business needs.

  • Learn Regex: Regex is a great tool to validate transformation rules and criteria on output.

  • Data-driven testing: Test with diverse datasets. Sometime actual production data does not have bad or corrupt data, but that not guarantee it won't have it in future. Always test your transformation with all possible datasets.

  • Positive and negative testing: Verify correct behaviour with not only positive scenarios but negative as well..

  • Consider performance testing: Evaluate the impact of transformations on processing time and resource usage, especially for large datasets. We have seen ETL where ingestion is happening daily and for each ingestion the job is running for 10 hours. That is not practical at all.

  • Document your testing process: Clearly document test cases, assumptions, execution steps, and expected results.

  • Test Requirement not the implementation: Although it is very helpful to understand how the transformation was done in ETL, but your focus should be on requirement (Well it is true for all testing not only ETL :D).

Conclusion:

Testing the transformation layer is crucial for ensuring data quality and driving informed decision-making in any ETL process. By understanding different transformation types, applying targeted testing strategies, and following best practices, testers can ensure a seamless flow of information through pipeline which clean and free of corrupt data. Remember, quality data is the fuel for effective business insights and strategic success. Lets learn and improve together.