Enterprise ETL Frameworks
Author
Larry Barnes, Stone Meadow Solutions LLC
Abstract
This is the first of two installments on ETL Frameworks articles. This first installment starts by making a case for why Enterprises need to develop and implement a Framework for Extract, Transform and Load (ETL) solutions. Next we cover concepts around ETL Frameworks and show some examples and benefits followed by a reference architecture examples and guidance on how to implement an ETL Framework including detail on execution components and dynamic configuration / initialization.
The second installment will cover data exception handling, and advanced topics including backing out ETL batches, detail on data and execution lineage, batch restart capabilities, ETL reporting and using ETL for Master Data Management solutions.
Copyright
This article makes references to the Stone Meadow Solutions ETL Framework which is ©2006-2008 Stone Meadow Solutions LLC, All rights reserved.
Audience
The primary audience for this article is architects, developers, analysts and managers working on data integration projects within organizations.
Background
The volume and scale of data integration within Enterprises has been steadily increasing over the years. The strategic role of data within enterprises is the key driver from the business. Contributing factors from the operations side are the proliferation of data within enterprises due to organizational complexity, adoption of LOB applications and increased need for data analytics.
The increased demand for data integration is directly related to this increase in data databases and database size. However, despite this need, the tools and building blocks supporting data integration within Enterprises have not kept pace with the increased data volume. This has resulted in increased Total Cost of Ownership (TCO) for enterprise’s ongoing data integration solutions.
Before continuing it’s useful to position ETL with other common data integration solution patterns.
Data Integration Patterns
Enterprise Application Integration (EAI) and Enterprise Information Integration (EII) are the other terms that you commonly hear today to represent different categories or patterns of integration.
The following table compares these integration patterns:
| ETL | EAI | EII | |
| Name | Extract, Transform and Load | Enterprise Application Integration | Enterprise Information Integration |
| Maturity Level | Mature | Mature | Emerging |
| Solution Type | Data warehousing, Reporting, Data refresh | Workflow, Business Process | Custom Applications |
| Integration Pattern | Data to Date | Application to Application, Business process | Data to Application |
| Data Latency | 1 day - 1 quarter | Near real time | Real time |
| Dataset size | Large data sets | Message based | None (Federated Data) |
| Dataset structure | Structured (OLE-DB, ODBC), Highly Structured (OLAP) | Structured (Messages) | Structured (OLE-DB, ODBC), XML, Semi-structured |
The
key to choosing a data integration solution typically involves the following
questions:
- What are the sources and destinations? ETL is the best choice if the destinations are databases and the sources are databases (some load their data into flat files first).
- What are the data volumes? If large, ETL is your best choice.
- How current does the data have to be? This is a tricky one. Most of the clients that I’ve worked with will start by stating that the data integration needs to be real time. The key here is to understand that real-time data integration results in far more complex solutions. ETL is choice if your data can be one day old. Also note that ETL is a perfectly good solution even in cases where data currency needs are less than one day.
ETL is the data integration choice if the data integration pattern is a “set-based” integration solution. I won’t spend any more time on this other than to mention that I've seen many developers choose an EAI solution like Biztalk for data integration patterns involving data sets over one million rows. This is a classic example of applying an EAI solution to an ETL pattern. If you do this, you'll end up with performance issues when you deploy.
That's just a very brief overview of integration patterns , now to the core content, i.e. ETL Frameworks.
ETL Framework Mission: Reducing Enterprise ETL TCO
An ETL Framework’s mission is to reduce Total Cost of Ownership (TCO) of an organizations active ETL projects. Note that the TCO has to be calculated over this extended period of time due to the fact that data integration solutions if many times in place for 5+ years.
ETL Frameworks: A Historical Perspective
The following slide presents a timeline that charts ETL’s growth and maturation over the years. Note that in my experience, the Total Cost of Ownership (TCO) for ETL Projects has continued to rise despite development costs decreasing…more on that later.
Roll your own / ETL Building blocks (…1993)
Data integration projects were custom development projects prior to the first generation of ETL products. During this period, the PC came of age and with it emerging standards like SQL, dynamic SQL and ODBC provided building blocks for ETL products. This provided the foundation for the move from mainframe to client/server applications.
This proliferation of client/server applications dramatically increased the number of relational databases within an organization. This in turn, increased the need for ETL applications within an organization.
1st Generation ETL – Code Generators
Most of the first generation ETL products were main frame centric and were predominately code generators. Prism, Carleton and ETI were some of the products during this time period. These products were very expensive and were not widely deployed within organizations.
2nd Generation ETL - Best of Breed
Data warehousing, new LOB applications, Internet money and wide adoption by consulting firms were all factors behind the emergence of the “Best of Breed” approach. The purchase of a Best of Breed ETL tool became a required purchase for not only large enterprises but also Internet start-ups.
3rd Generation ETL - Integrated BI Stacks
The Best of Breed era resulted in a proliferation of BI products within organizations. The downturn in IT spending post Internet bust resulted in organizations consolidating their BI product mix. This resulted in reduced revenues, lower profits and the consolidation within the BI industry.
IBM, Oracle, Microsoft, SAS and Cognos (now IBM) all have ETL products as part of their integrated BI stacks. Informatica is still an independent entity and the leader in the ETL space. However, the existence of an ETL product within all major vendors’ BI stacks will lead vendors to cross sell their offering instead of partnering with standalone ETL tools.
Trend - ETL Development Costs
The combination of cheaper hardware, more robust ETL products, decreasing ETL product costs and the growth of experienced ETL consultants have all contributed to lower ETL development costs.
So, why hasn't ETL project costs tracked the drop in hardware and software prices. This is due to the large up-front project costs for most ETL projects. Many times the requirements, analysis and QA phases are significant due to changing requirements, stove piped knowledge and changing source data. In addition, late-arriving source data can result in two development efforts, one for the expected data, one for the actual data.
Reducing the development time and costs for ETL solutions is another interesting topic but is out of the scope for this paper. Check on my web site, www.stonemeadowsolutions.com/library, for a follow up paper on this topic.
Trend – ETL Total Cost of Ownership
The total resources and costs to support an organization’s ongoing ETL processes has increased over the years. One reason is there have been more ETL deployments within organizations over time. However, I also submit that there hasn’t been enough time spent addressing the “Care and feeding” aspect of active ETL deployments.
Why is ETL TCO still high?
The following are some of the reasons that I’ve seen ETL TCO remain high within my client base over the years.
Note that People and Process are the most important factor in the success of any project…however changing the behavior of organizations is beyond the scope of this article. Instead we’ll focus on what IT resources can do to reduce the ongoing “Care and feeding” costs for ETL deployments.
How can ETL Frameworks help?
As mentioned above, our primary focus in this article will be on how an Enterprise ETL Framework can help to minimize the amount of IT and Business resources required for an organization’s ETL deployments and what that ETL Framework would look like. ETL Frameworks can assist with the issues listed above.
Creating a series of ETL templates will encapsulate best practices and help reduce your ETL development costs. More importantly the consistency and repeatability that the templates bring to ETL solutions also drive down operational costs. Combine this with an ETL Framework's support for robust logging, easy and flexible configuration and robust reporting and you'll start to see reduced on-going operational costs for your ETL solutions.
But, before we delve into the details of an ETL Framework, we need to cover some basic concepts.
ETL Concepts
Since most ETL processing is done on a regularly scheduled basis it makes sense to organize your execution by your enterprise ETL schedule. Note that two major factors determining an ETL schedule are source data availability and the destination system’s ability to process the source data. Also by grouping related activities together into one batch you now have a starting point for repeatable hands off processing.
The example above groups all end-to-end activities together for the Sales destination subject area. Data flows from the source system to the Raw and Stage processing areas into the Sales data mart and sales cube destinations.
Robust lineage provides significant benefits to the business. This is especially true now that data privacy issues and new regulations around data make it paramount that you can track a destination result back to its source data as well as the processes involved in creating the data.
How many of you are confident that your ETL destination databases will pass a data audit with a clean bill? No is the overwhelming response that I get from most of my past clients and prospective clients. Assuming that your database will never be audited isn't a wise solution to this problem especially given the issues listed above combined with the increasingly strategic nature of data within organizations.
Robust lineage supports audit-ability. You can deterministically show what processes produced destination data and the source data where it originated. It also allows you to “back-out” results that were created by fault ETL processes. Note that this is important because many times a business analyst will not notice data issues until day’s possible weeks after data has been loaded.
The ability to show the data and execution audit trail removes concerns around sensitive data being exposed. In addition, lineage also helps your overcome business user’s skepticism around data correctness.
Note that in the example above data and execution lineage columns are added to every table. The existence of these values allow dev, QA and business analysts to easily determine the instance of source data and execution instance that populated the row.
In the example above:
- The Data Lineage value (1) stays constant throughout the ETL process. This allows an analyst to easily look back to the source of the data during analysis.
- The Execution Lineage is different to each stage in the ETL process. This allows an analyst to easily locate what ETL process was responsible for generating the last instance of the data.
Note: The above approach assumes that the Source lineage ID represents the “Master” source record in cases where one destination row is populated from multiple sources rows. True data lineage would require a separate value per column loaded from a data source. This would be overly complex and impact performance.
Activities are a logical construct that encapsulates ETL processing. This construct is helpful for ETL Frameworks since it allows for the construct that supports ETL Frameworks and provides a “Contract” between the ETL Framework and the ETL developer.
In the example above, there are three activities: Cust_Src_To_Raw, Cust_Raw_To_Stage, Cust_Stage_To_Datamart. The ETL Framework logging facility will create 1 Activity record for each invocation of each of the above activities. This Activity record identifier then is used to populate the Execution Lineage ID column.
An alternate solution to creating an activity would be to use SSIS packages as the base implementation unit. This would require that there is only one dataflow per SSIS package. We will cover the pro's and con's of this approach later in this article.
It’s important to distinguish between errors and exceptions. Error’s impact processing and typically occur at a system or code level. Exceptions impact data and should be handled in a fashion that doesn’t stop all succeeding processing (except of course in cases where a majority of source data has exceptions).
ETL rollback, Batch restart and Exception life cycles are advanced topics that will be covered in another article.
Now that we’ve covered the concepts, it’s probably useful to create a scenario that demonstrates these concepts.
Scenario
The following simple scenario is used to demonstrate ETL Framework capabilities presented in this paper:
We’re extracting data from a source system using date ranges, moving it to a staging area and then to its final destination. During the process, we’ll move some source records into an exception database.
This is ETL at its most basic; however it is an end-to-end scenario which will allow us to cover the ETL Framework in more detail.
So far, we’ve covered why ETL Frameworks are needed and some basic ETL concepts underlying an ETL Framework. This next section will introduce the ETL Framework using reports within the Framework to demonstrate capabilities and benefits.
ETL Framework Reporting Overview
ETL Framework’s robust logging capabilities provide significant benefits, including the following:
- Complete transparency on batch execution results. It’s very easy to obtain summary information on all Batch processes. In addition, drill downs provide additional detail when required.
- Reconciliation of Source to destination data flows. Obtaining information on how records are dropped from ETL execution flow is usually a time consuming effort. Note in the ETL Framework we use following formula is used to reconcile source to destination processing: Records Read – Ignored – Exceptions = Rows Inserted + Rows Updated
- Data exception summary information. This provides a comprehensive list including counts of why records fell out of ETL batch process.
Below we present a scenario that an ETL operations resource might use to understand the current state of ETL processes within their environment.
The first report shows the batch execution instances that have occurred since the specified date…
The benefit of this report is that an ETL resource can quickly understand the state of an Enterprise’s ETL activity from a point in time. Of course, we only have one scenario, but notice how different execution instances completed or stopped because of system errors.
The first thing we might want to look at is why Batch# 1 failed. To do this we click on the Batch instance's Batch ID column and are taken to the following detail report…
Here we see that we renamed the variable cSrcExcDsc variable referenced within the AddLineage transform within the dfSrcToStg Dataflow. This gives the ETL resource a direction for diagnosing and solving this programming error. Note: This error shows up at runtime because we've set the DelayValidation flag to True. This allows the dynamic initialization of the ETL system configuration.
Notes:
- The ETL Framework notification facility will also email specified resources with an URL pointing to the execution instance within the above report.
- This demonstrates the benefit of drilldown capabilities, i.e. leading the ETL operator from summary to the lowest level of detail when required. SSRS provides built in capabilities for drilldown reports.
The next report provides detail on a success batch execution instance, #2…
Notes:
- We can see high level information about the batch as well as some detail
- We can also see whether the data reconciled
- We get detail on the SQL statement used for all Source to Destination activities within this batch
- We get detail on why data fell out of the ETL process due to exceptions
Finally, in this scenario, the ETL operator may want to drill down to more detail, in this case the Activity detail report…
In this report, the operator sees the ordered sequence of activities within one ETL batch. Note that although this report shows activity for a completed batch, it's very useful for documenting the current state within an active ETL batch. For an active ETL batch, there will be one or more activities with a status of “Active” not “Complete”.
Summary
In this section we’ve introduced ETL Framework concepts and have provided reports that demonstrate some of the ETL Framework capabilities. Hopefully these reports give you a feel for how implementing an ETL Framework can minimize the resources required for the ongoing “care and feeding” of an Enterprise’s ETL needs. This is especially important when errors occur, the data doesn’t reconcile and/or there are data exceptions.
In the next section we’ll overlay the ETL Framework architecture and the components of the SMS ETL Framework that implement these concepts.
ETL Framework Architecture
The diagram below illustrates different components within an ETL Framework architecture.
ETL Framework Architecture Overview
Execution Components
- Master Package – “Traffic cop” for an ETL Batch
- Execution Package – Provides workflow for Activities
- Container – Provides the ETL Framework plumbing for Activities
- Data Flow – Responsible for all ETL processing
- Helper Tasks –Common tasks used by multiple ETL processes
ETL Framework Capabilities
- Functionality supported by the ETL Framework. Implemented by a combination of the Master Package, Execution Packages and Containers.
ETL Framework Databases
- Configuration – Contains both runtime and system configuration information
- Exceptions – Receives all data exceptions rows that fall out during ETL processing
- Log – Populated with all Master Package, Activity and Error information from ETL Batch runs
- Audit – Populated with pre-update and deleted records from ETL processing
Utilities
- Reporting – Provides detailed information on ETL Batch execution
- Analytics – Provides metrics on ETL Batch execution
- Configuration UI – Allows ETL Operators to configure ETL batch run time and system configuration values
- Data Steward UI – Provides reports and utilities in support of Data Steward analysis and activities
Execution Components
This section provides more detail on the ETL Framework’s execution components starting with the two different types of packages.
Master Package
The following is a screen shot of the Master package for the Simple scenario reference application.
This Master package is a reusable template containing stock ETL Framework tasks. The customization required for a specific implementation consists of calling and controlling the flow of execution packages combined with the configuration and initialization of variables used by the execution packages.
ETL Framework tasks:
- CfgRunSys_EtlFwk – Initialize SSIS variables used in runtime and system configuration
- IniETLBat_EtlFwk – Initializes the ETL Batch run instance, get the ETL batch identifier
- WrtVars_EtlFwk – Writes run time and system configuration information for this execution instance into a table. This table’s is later queried by execution packages for initialization.
- UpdETLBat_EtlFwk – Updates the ETL Batch run instance state and completion date/time
- Notify_EtlFwk – Send an email alert with the batch processing information including links to reports
Implementation specific tasks:
- Exe_Pck_SrcStg – Call the Exe_Pck_SrcStg.dtsx execution package to read source files into the Staging database
- Exe_Pck_StgDst – Call the Exe_Pck_StgDst.dtsx execution package to read source files into the Staging database
Master Package Variables:
- ETL*– Variables used to hold the ETL Batch instance state including it's Identifier, version, Activity Id and execution state
- cfg* - Variables initialized from the ETL Framework's configuration database
- inp* - Variables that are passed to Execution packages
- v* - Variables used within Master package processing
Workflow notes
SSIS packages link tasks together to define workflows. By default this workflow checks for task completion status to determine the processing sequence. One common technique that I use with SSIS is to combine expressions with execution status to define more robust workflows.
The screenshot below shows an example of this. This is the workflow between The Placeholder script task and the CfgRunSys_EtlFwk task. Note the only purpose of the Placeholder task is to provide a starting point for the Precedence constraint shown below.
In this instance, we only continue execution if the ETL Framework Configuration Instance value, cfgRunId, initialized correctly, i.e. >= 0.
Note: The cfgRunId parameter is a required input for CfgRunSys_EtlFwk script task. This will be covered in more detail in the Configuration section later in this paper.
The screenshot below shows the workflow used in all other workflows where we use expressions. Workflow will only continue if the batch state is equal to 1 and the status is Success.
Note that the different states for the master package are:
| ID | Value |
| 0 | New |
| 1 | Active |
| 2 | Complete |
| 3 | Error |
Summary
The master package is the “traffic cop” responsible for all configuration, workflow and all notification functions for one ETL Batch. The master package uses a combination of stock ETL framework tasks, implementation specific tasks and utilizes both expressions and status checks for determining the package workflow. In our example above, other than invoking child packages, the only implementation specific tasks were the ones that loaded Source system control records.
Execution Package
The execution package is responsible for activity workflow and environment initialization. The screenshot below shows the workflow for the source to staging SSIS package for the simple reference application. Note that this Execution package contains one activity and the activity contains one data flow.
The SMS ETL Framework activity-centric approach supports more complex execution packages, i.e. packages with > 1 data flows. We’ll discuss execute package complexity shortly.
The following tasks are ETL Framework Stock components:
- InitVars_EtlFwk – Initializes SSIS variables configured by the master package
- IniCtl_EtlFwk - Create a Control record for this data flow
- IniAct_EtlFwk – Create an instance of an activity
- UpdAct_EtlFwk – Update the activity instance with the state and date/time
The SrcStg Container is a template containing stock event handlers and SSIS variables which define the “Contract” between the ETL Framework and data flows. This container is used to define activities within the ETL Framework. The dfSrcToStg data flow is responsible for data movement.
Runtime initialization
The execution package is responsible for initializing the environment for activities to run. This is done through a combination of SSIS package configurations and the InitVars_EtlFwk task which reads batch instance variables configured by the master package. The screenshot below shows the SSIS package configuration for the execution package.
Note that we utilize Environment variables to initialize the SMS ETL Framework database connection string. We use parent package variables to pass the Batch id and version number. All other variables are initialized by the InitVars_EtlFwk task which uses the ETL batch id and version number to read all of the configuration parameters set by the master package.
One question that is often asked is “what combination of SSIS package configurations and configuration tables should I use? My answer is:
- XML or Environment Variables: The SSIS Logging database (in our case this table is stored in the SMS ETL Framework database). Note that you will need to use a configuration other the Parent package variables due to the fact that SSIS binds to the logging connection manager before it reads the parent package variables.
- Parent Package Variables: All connection strings with passwords. This minimizes the number of places where sensitive passwords are stored.
- Parent Package Variables: Parameters that are input into the execution package initialize itself from a database configuration table. In this example the the ETL batch identifier and version number.
- Configuration table - All other parameters.
Most SSIS implementations that I’ve seen use XML configurations. That’s due to the fact that most configuration samples use XML and today’s developer is very comfortable with XML. The reason why I don’t typically use XML is that in my experience operations resources prefer database tables to XML files. So make sure that you meet with your ETL operations team prior to determining your SSIS configuration approach.
The operational lifetime for an ETL solution is many times longer than the development time so again it’s important to develop your solution with the on-going ETL operations team in mind.
Error Handling
The screenshot below shows the stock ETL Framework tasks used to process the OnError event. Note that the SetPropogateOff script task disables the propagation of errors from the execution package to the master package. This eliminates duplicate errors getting logged by both the execution and master packages.
Example package pattern: simple or complex?
One key implementation decision that you will need to make for your ETL solution is whether execution packages should be complex or simple? Put another way, should there be one data flow per execution package or many data flows?
Many experienced ETL architects and developers that I work with prefer the simple execution package approach, i.e. only one dataflow per package.
Simple package advantages:
- This allows developers to more easily work in parallel. Complex execution packages may result in one developer waiting for the other to finish with a package before they make their changes.
- Less confusion around SSIS variable scope. Both experienced
and novice SSIS developers will at some point in time
create a variable with the wrong scope. Setting all variables to a package scope removes this issue.
Simple package disadvantages:
- The master package workflow becomes more complex. Workflow around data flow sequencing has to be done somewhere. If it’s not in the execution package then it’s in the controller package. This means that there will be more developers modifying the master package, which will increase the likelihood of bugs.
- More packages means that there are more moving parts. This increases the likelihood that the SSIS deployment will fail when moving between environments. It also means that more packages will need to be modified when there are changes to stock tasks / components within each package as well as changes in package configurations.
Complex package advantages:
- Less packages to deploy and maintain.
- Encapsulate all data flow workflow within the execution packages. This reduces the number of developers who edit/modify the master package
Complex package disadvantages:
- Issues with SSIS data variable scoping
- Potential issues around multi-developer ETL projects…although my experience is that one developer will usually be responsible for everything within one package
My preference is to have less execution packages with more complexity in each. Yes, you’ll have issues around SSIS variable scoping…but creating more robust packages means that the master package is more straightforward and there’s less .dtsx files to worry about when it comes time to deploy to an environment of if there’s a need to make a modification to underlying ETL Framework tasks.
Also, consider your ETL team structure. In the larger ETL picture, one developer / architect is responsible for the “master package” and many ETL developers will be responsible for implementing the execution packages. Given this model, it makes sense to place all data logic in place which in turn makes for more complex execution packages.
The screenshot below demonstrates a more complex execution package. In this example we’re moving customer, agreement and transaction from a flat file source into a staging area.
Notes:
- This execution package invokes three activities in parallel. The template execution package for this scenario included only one activity, LoadFilesToSrc. We cut and pasted this activity to create the three instances shown above.
- We’re using a ADO.NET Foreach container to iterate on a Control record set initialized in this implementation's Master package.
- Each activity will load files based upon dates recorded in our control tables. Put another way, there may be 7 transaction files, one for each day of the week. Loading them into a control file allows us to sort them by date so that they can be processed in an ordered fashion.
- The equivalent “simple” execution package scenario would have each of the activities above in their own package.
Summary
Execution packages are invoked by the master package and are responsible for initializing the environment and controlling the workflow for ETL activities. Your implementation team should make the decision up front to build simple, i.e. one dataflow, or more complex execution packages. In addition, the developers should bring the operation’s team in early in the ETL project to get their preference for configuration format, i.e. XML or database tables.
Activity
An activity is a logic construct that is built around ETL data flows.
ETL Framework activities use SSIS Sequence containers to provide stock activity templates. The primary objective of an activity template is to provide the ETL Framework plumbing around a data flow. The ETL developer is responsible for placing their data flows within the activity template and initializing the SSIS ETL variables used by the ETL Framework to log detailed data flow information including row counts and amounts.
In our simple reference application, the SrcStg container shown below creates the activity, executes the data flow and then logs all source and destination information into the ETL Framework log database.
The screenshot below shows the OnPostExecute event handler for the UpdAct_EtlFwk task. This template first logs source and destination information and then logs one exception record for every exception encountered within the data flow. Note that you will need to invoke the exception logging task for every exception encountered in the data flow.
We’ll cover data exceptions in more detail shortly.
Data Flow
Data flows are responsible for all data movement and manipulation. Data flows are typically implemented using SSIS data flows, SQL tasks or a combination of the two. The screen shot below shows a data flow for the Source to stage operation in our simple reference application.
Note that data flows will vary depending upon the source, the destination and the amount of transformation required. The data flow above was created from a simple Source to Destination data flow template. The stock transformations include all RowCount transforms (ReadCnt, IgnoreCnt, ExcCnt, InsertCnt) and the AddLineage Derived column transform.
Note: This is a simple example of data exceptions, many times there will be multiple business rules within a data flow detecting and reporting data exceptions. This topic will be covered in more detail in a follow up article.
We’ve added two Conditional transformations to demonstrate where an ETL developer would log ignored rows and exceptions. Remember that to support automatic reconciliation we need to calculate:
(Rows read – ignored – exceptions) = (Rows inserted + updated)
Note: Most ETL implementations don’t take ignored rows into account and in doing so remove any chance of an automated reconciliation processes.
The RowCount transforms are examples where the ETL developer initializes the following SSIS variables: cSrcReadCnt, cSrcIgnCnt, cDstInsCnt, cSrcExcCnt. These variables are used by the LogSrcDst_EtlFwk task above to initialize Activity Source and Destination records.
Adding RowCount transforms does add additional complexity to an ETL project. However the benefits are substantial, in this case it allows us to automatically implement reconciliation (see the Batch Detail report above).
Finally, note that we have a Multicast transform which allows us to update our Control record with the Maximum date value obtained in this particular data flow instance. This is a common technique used in "Pull" Source ETL implementations. This allows the ETL Framework to then use this Maximum date value as the Minumum date for the next iteration of this data flow. Other variants of this pattern may use an increasing key value as a filter.
The other common Source system extract scenarios "Push" rather than "Pull". One "Push" instance is file based where a load file is created (typically with a control record containing high level info including record counts.
The other "Push" instance is where a Control table is loaded with filter values which then allow the ETL program to select only those rows that were published by the push application.
Summary
The data flow is where most of the ETL implementation specific logic will reside. The goal of the ETL Framework is to let an ETL developer to either start with a stock activity template or to implement the dataflow and then paste it within the stock activity template. Note that the second approach will require the ETL developer to add the RowCount transforms which populate the counts required by the ETL Framework in support of reconciliation.
Execution Component Summary
The ETL Framework is developed using a combination of Master Packages, Execution Packages, Activity Containers and Data Flows. These execution components combine standard ETL Framework tasks and transforms along with implementation specific components to implement basic and advanced ETL Framework capabilities.
The implementation specific ETL tasks and transforms will leverage standard “templates” which ensure consistency, reduce the time required to develop ETL solutions while providing ETL Framework features.
Note that these templates require the underlying ETL Framework database which contains the stored procedures used to populate the ETL Framework logging tables.
Configuration and Initialization
Background
Many times a significant amount of ongoing TOC for an ETL solution centers on the need to change code in support of ETL different environments and run-time options. Given this, the goal for all ETL solutions should be a “hands-off” approach, especially when the solution is promoted to production and comes under the control of ETL operations.
Implementing a “Hands off” ETL solution requires the ETL code to dynamically configure both system parameters as well as run-time parameters, i.e. values unique to one ETL batch execution instance.
System parameters are typically used to configure SSIS connection managers including database connections, file directories and files.
Examples of Run-time parameters are date ranges or key ranges for "Push" source extracts as well as the system configuration instance, e.g. DEV, QA, PROD. In addition they can include parameters which indicate which packages and tasks for execution for partial or “one-off” batch runs. For example, testing the Stage to Production package in QA requires the ability to skip over the source to stage package.
Configuration and Initialization can be divided into two components:
- Dynamic initialization. Using SSIS variables, expressions and script to dynamically initialize SSIS connection managers as well as variables that are used to control execution flow.
- Configuring the SSIS variables used in the dynamic initialization above.
Approach
The ETL Framework stores its configuration information in a database and queries this database at run time to configure the solution. Note that many ETL solutions utilize SSIS’s ability to read in XML configuration files for this purpose. This approach works, however my experience with clients has shown that many operations groups prefer the centralized database approach over a more decentralized file based approach.
The following shows the command line syntax for invoking the simple scenario:
dtexec /FILE "C:\ETLFramework\ETLFwk_Simple\EtlFwk_Simple\MP_Simple.dtsx" /MAXCONCURRENT " -1 " /CHECKPOINTING OFF /REPORTING V /SET "\Package.Variables[User::cfgRunID].Value";1
Note that in the example above we point to the master package (MP_Simple) and set the run time configuration variable (cfgRunID) to 1. This identifier is the primary key into our run-time configuration table.
The following report shows both the run time and system configurations for this scenario…
Note that:
- We’re setting the extract date that we’ll use when determining which source data to fetch
- We’re setting our system configuration environment to DEV
- The DEV system configuration initializes the file directory where the ETL packages reside as well as the database connection information required by the master and execution packages
- Our configuration consists of name/value pairs where the name represents the SSIS variable within the master package. Notice the naming convention where all configuration parameters start with cfg, which allows SSIS to display all configuration parameters together.
Notice that the Master package screen shot earlier in this article has one variable for every "Name" column listed above. We use this Name/Value concept for both the Master package configuration as well as Execution package initialization. The following code snippet, taken from the InitVars_EtlFwk script, is used to initialize the execution package variables. We basically build a record set using the ETL Batch ID and version as filters and then iterate through this record set using the "Name" column to identify the SSIS variable and the "Value" column to initialize it.
Try
Dim cnETLFwk As String = Dts.Variables("cnETLFwk").Value.ToString()
cnCfg = New SqlConnection(cnETLFwk)
Dim ETLBatId As String = Dts.Variables("ETLBatId").Value.ToString()
Dim ETLVerId As String = Dts.Variables("ETLVerId").Value.ToString()
Dim sqlCmd As String = "SELECT Prm_Nm, Prm_Val FROM [LOG].ETL_Pck_Prm
WHERE ETL_Bat_Id = " + ETLBatId + " AND ETL_Bat_Ver =" + ETLVerId
cnCfg.Open()
Dim cmd As New SqlCommand(sqlCmd, cnCfg)
Dim rdr As SqlDataReader = cmd.ExecuteReader()
cmd.Connection = cnCfg
While rdr.Read
Try
varNm = rdr(0).ToString()
If Dts.VariableDispenser.Contains(varNm)
= True Then
Dts.VariableDispenser.LockForWrite(varNm)
Dts.VariableDispenser.GetVariables(vars)
If
varNm.Contains("int") Then
vars(varNm).Value = CInt(rdr(1).ToString())
Else
vars(varNm).Value = rdr(1).ToString()
End If
vars.Unlock()
End If
Catch ex As Exception
vars.Unlock()
End Try
End While
End Try
Bootstrapping configuration
The MP_Simple master package requires that ETL Framework database to be initialized before it can start reading the parameters stored in the database. The approach we use involves the following steps:
- Use Expressions to dynamically build the SMS_ETL_Framework SSIS Connection property
- Use package configurations to initialize variables used in the variables expression above
The following screenshot shows the SMS_ETL_Framework Connection’s expression screen, this is where we use the Server name and database name parameters passed in through SSIS configurations.
The screen shot below shows how we use package configurations and Windows environment variables to initialize the server and database names.
Note: Many SSIS implementations use XML configurations which are perfectly valid, our selection of Environment variables is based upon our experience that DBA's and Infrastructure resources are less familiar and comfortable with XML than developers. In addition, XML files can always be located in the wrong location while environment variables will always be in the same place.
Note: We could also have initialized the provider, but have in this simple scenario we’ve set the cfgProviderNm variable in the package to always be SQLNCLI.1.
Once we have a ETL Framework connection, then all configuration parameters are read from the SMS_ETL_Framework configuration table.
Note: For all connections we use a similar pattern. For example, here’s the expression used to initialize the Simple database’s inpCnSimple variable:
“Data Source=” + @[User::cfgSQL2005SrvNm] + “;Initial Catalog=” + @[User::cfgSimpleDbNm] + “;Provider=” + @[User::cfgProviderNm] + “;Integrated Security=SSPI;Auto Translate=False;”
Tip
One question that I get is “How do I know how to initialize the Connection string? This is especially problematic for some connections, like Excel spreadsheets. What I do is first to create a Connection manager using the Connection manager screen.
Once you create the connection, then go into the Connection’s property manager and copy the string in the ConnectionString value as shown below…
The last step is to copy the ConnectionString into the variable’s expression window and then substitute the SSIS variables with the text (see the inpCnSMSETLFramework expression above).
The last example shows how we dynamically set the Execution package’s Connection String to allow us set the directory at runtime. The following expression is used to initialize the ExePck_SrcStg.dtsx ConnectionString property.
@[User::cfgDtsxDirNm] + "ExePck_SrcStg.dtsx"
Hopefully by now you can see the benefit of SSIS expressions. It provides a very flexible mechanism for dynamic initialization without having to write script tasks.
Finally note that MP_Simple package is responsible for creating all Database connection strings. These connection strings are then passed to each Execution package through the Configuration database table…except for the Connection strings with passwords; these are passed to the Execution package by using the Parent package variable configuration.
Summary
A “Hands-off” approach should be the objective for every ETL solution targeted for a production environment. Achieving this requires you to: first use variables and expressions to dynamically configure all connections and workflow variables, second use SSIS and custom configuration to populate these variables at runtime. When determining SSIS configuration options, make sure that you loop your operations team into the discussion. Many operations teams are not fluent in XML so configuration tables are a preferred approach.
ETL Framework: Team structure, Exceptions, Advanced Topics
That concludes this first article on ETL Frameworks. The next article will focus on additional topics related to ETL Frameworks including team structures, data exceptions and other advanced topics