I always hesitate to write about anything involving someone else’s code, lest the favor get returned. In this case however, one must have some sympathy for the developer, as the source system for a particular ETL job was not yet in production. He had to write his packages against highly idealized, hypothetical test data. Surprise! Surprise! The real data causes some issues.
My predecessor followed a standard set of steps for any SSIS work for loading a data mart:
1. Get source data and load it to a staging table
2. Select data from the staging table relevant to the fact table to be populated
3. "Look up" relevant foreign keys using data from the staging table - this creates the correct referential integrity between the fact and dimension tables in the data mart. This was done with a look up task in SSIS.
4. Load the fact table with the results of the prior two steps.
Once we got into “real” data, the issues associated with look up tasks became apparent. We repeatedly saw errors in the look ups. The error, however was extremely vague and provided no direction or ability to track down the issue. It was “Row yielded no match during lookup.”
The first source of these errors were the differences between the datatypes and collations used in the actual source system, in this case Oracle and the target system, SQL Server 2005. When we used a simple SQL join, the joins also failed with collation errors. It was not until we used a join with the appropriate collation logic that we were able to actually “look up” the data as follows:
SELECT DISTINCT <Columns>
FROM My.STAGING_Table S left outer join dbo.dimension D on S.Unique = D.Unique COLLATE Latin1_General_CI_AS
Other look up errors, frankly make no sense to me, as the data types and collations were the same and SQL “look ups” did not require data conversions or collation hints. The simple “lookup” works fine
SELECT *
from STAGING.myTable
where uniqueString in (select uniqueString FROM dbo.DimensionTable;
(Yes I know this is sloppy syntax. It was a down and dirty test!)
In the end we replaced data transfer tasks with multiple look ups with a SQL selects and inserts using OLEDB sources and destinations. Every lookup encountered could be replaced with a join or a common table element. This provided several advantages. First, all of the data required was obtained with a single set operation, instead of multiple trips to the database to lookup values. Second, we had much more insight into exactly where potentially “bad” rows of data existed, so we could handle those errors far more elegantly. Third, the operations actually ran slightly faster.
Another reason to use those old school set operations!