Readme_Data Cleaning Package Sample

11/05/2008 21:36:06


This sample works only with SQL Server 2005 and SQL Server 2008. It will not work with any version of SQL Server earlier than SQL Server 2005.
This sample works with the SQL Server 2005 version of the AdventureWorks OLTP database. To install this database, see Sample Databases for Microsoft SQL Server 2008.
The Data Cleaning sample is a package that cleans data. The package uses data that is a list of names and addresses that represent potential customers. The data requires cleaning; it contains spelling errors, is missing information, and includes customers already in the database, incorrect customers, or multiple subtly different instances of the same customer.
The package control flow consists of the following tasks:
  • An Execute SQL task that creates the input table, CustomerLeads, and creates the three output tables named ExistingCustomerLeads, NewCustomerLeads, and DuplicateCustomerLeads.
  • A Data Flow task that performs the cleaning of the data that is extracted from the CustomerLeads table. The data flow identifies unique new, existing, and duplicate customers, and writes the rows of each customer type to the appropriate output table.
  • Another Execute SQL task that runs a parameterized SQL statement that returns a row count for the ExistingCustomerLeads table.
  • A Script task that displays the row count value. If you run the sample on a non-English version of Windows, you may have to substitute the localized name of the Program Files folder to open or run the sample.

Note:
This sample uses the Fuzzy Grouping and Fuzzy Lookup transformations, which are available only in the Enterprise version of SQL Server.




Important:
Samples are provided for educational purposes only. They are not intended to be used in a production environment and have not been tested in a production environment. Microsoft does not provide technical support for these samples.



To learn more about data cleaning, search for the following articles in the MSDN Library:
  • Data Cleansing Applications with SQL Server Integration Services (Windows Media Video)
  • Data Cleaning using the Fuzzy Grouping and Fuzzy Lookup Transformations (white paper)

Requirements

Running this sample package requires the following:
  • You must have installed and have administrative permissions on the AdventureWorks OLTP database.
  • If you intend only to run the sample package from the command line, you must install Integration Services.
  • If you intend to open the package in SSIS Designer and run the sample package, you must install Business Intelligence Development Studio. For more information about how to install samples, see "Installing Sample Integration Services Packages" in SQL Server Books Online.

Location of the Sample Package

If the samples were installed to the default installation location, the Data Cleaning package is located in the following folder:
C:\Program Files\Microsoft SQL Server\100\Samples\Integration Services\Package Samples\DataCleaning Sample\Data Cleaning\.
The following files are required to run this sample package.

File Description
DataCleaning.dtsx The sample package.
CreateTables.sql SQL statements to create tables.


Adding Data Viewers to the Sample

To better understand how the Data Cleaning package works, you can add data viewers to the data flow and then view the data as it moves between data flow components. We recommend that you add data viewers to the following paths:
  • Pathfrom* Union All* to OLE DB Destination-Existing Customers
  • Pathfrom* Conditional Split on Canonical Record for Group* to OLE DB Destination-Unique Customer Leads
  • Pathfrom* Conditional Split on Canonical Record for Group* to OLE DB Destination-Duplicate Customer Leads
To add data viewers
  1. Right-click the path and then click Data Viewers.
  2. In the Data Flow Path Editor, click Add.
  3. In the Configure Data Viewer dialog box, click Grid in the type list. By default, all columns display in the data viewer.
  4. Repeat steps 1-3 for other paths.

Running the Sample

The package can be run from the command line by using the dtexec utility, or can be run in Business Intelligence Development Studio.
If you are using a non-English version of Windows, or if you have installed the samples to a non-default location, you may have to update the ConnectionString property of any file connection managers used in the package to run the sample package successfully. Verify that the path used in the connection manager is valid on your computer, and if required, modify the path so that it uses the correct path to the sample files.
For this sample, you may have to update "Program Files" in the ConnectionString property for the CreateTables.sql connection manager.
To run the package by using dtexec
  1. Open a Command Prompt window.
  2. Change the directory to C:\Program Files\Microsoft SQL Server\100\DTS\Binn, the location of dtexec.
  3. Type the following command: * dtexec /f "C:\Program Files\Microsoft SQL Server\100\Samples\Integration Services\Package Samples\Data Cleaning Sample\DataCleaning\DataCleaning.dtsx" *
  4. Press Enter.For more information about how to run the package by using the dtexec utility, see the topic, "dtexec Utility", in SQL Server Books Online.
To run the package in Business Intelligence Development Studio
  1. Open Business Intelligence Development Studio.
  2. On the File menu, point to Open, and then click Project/Solution.
  3. Locate the DataCleaning Sample folder, and then double-click the file named DataCleaning.sln.
  4. In Solution Explorer, right-click DataCleaning.dtsx in the SSIS Packages folder, and then click Execute Package.
Note:
If you open the package in SSIS Designer and view the package properties, you will notice that the DelayValidation property is set to True. Validation of the package must be delayed because some tables used by the Data Cleaning sample package—the CustomerLeads, and the three output tables named ExistingCustomerLeads, NewCustomerLeads, and DuplicateCustomerLeads—are not created until the first time the package runs. If DelayValidation is set to False, a validation error occurs when you open the package in SSIS Designer before running the package.



Components in Sample

The following table lists the tasks, containers, data sources and destinations, and transformations that are used within the sample.

Element Purpose
Execute SQL task The Execute SQL task is named Create Customer Address Reference Table View, Populate NewCustomer Input Table and Create Output Tables.This Execute SQL task runs a stored procedure that creates the input table, CustomerLeads, and also creates the three output tables, ExistingCustomerLeads, NewCustomerLeads, and DuplicateCustomerLeads. For its return code, the stored procedure returns either a value of 0 or 1. A value of 0 indicates that the stored procedure ran successfully; a value of 1 indicates that the stored procedure did not run successfully. The task stores the return code in a package variable, ReturnCode. Between the Execute SQL task and the Data Flow task is a precedence constraint. This precedence constraint allows the Data Flow task to run only if the following conditions are true:
  • The Execute SQL task finishes successfully.
  • The ReturnCode variable contains a value of 0. |
Data Flow task The Data Flow task, Fuzzy Lookup Data Flow Task, executes the data flow in the package.
OLE DB source The OLE DB source, OLE DB Source - Customer Leads, reads records from the CustomerLeads table.
Lookup transformation The Lookup transformation, Lookup against Existing Customers, performs an exact lookup to identify existing customers. The Lookup transformation compares each row of customer records in the input table, CustomerLeads, against the entries in the Lookup reference dataset, and then does one of the following actions:
  • If the row has a matching entry in the reference dataset, the transformation directs the row into the into the ExistingCustomerLeads table.
  • If the row has no matching entry in the reference dataset, the transformation directs the row to the Fuzzy Lookup transformation, Fuzzy Lookup against Existing Customers, for further comparison. The Lookup transformation runs in Partial cache mode, and caches rows that have no matching entries in the reference dataset. |
Derived Column transformation The Derived Column transformation, Derived Column, adds the _Similarity columns to each row and sets the column value to 1.
Fuzzy Lookup transformation The Fuzzy Lookup transformation, Fuzzy Lookup against Existing Customers, performs a fuzzy lookup to identify customer records that are fuzzy matches of existing customer records. The transformation adds a _Similarity column that contains a similarity score to each row. A score of 0.0 means no match was found, whereas a score of 1.0 means an exact match was found. A score between 0.0 and 1.0 is a measure of similarity in which a value closer to 1.0 indicates greater similarity.
Conditional Split transformation The first Conditional Split transformation, ConditionalSplit on Similarity, directs input rows to one of two outputs depending on the value of the similarity score determined by the fuzzy lookup. Rows with a similarity score >= .70 are written to the ExistingCustomerLeads table. Rows with similarity scores < 70 are probably valid new customer leads and additional cleaning is done on these rows. The second Conditional Split transformation, Conditional Split on Canonical Record for Group, directs input rows to one of two outputs depending on whether the data row is a duplicate. If the values of the keyin and keyout columns are equal, the row is used as the canonical row in the group, and the canonical row is inserted into the NewCustomerLeads table. If the keyin and key_out columns are not equal, the row is treated as a fuzzy duplicate and the row is inserted into the DuplicateCustomerLeads table.
Union All transformation The Union All transformation, Union All, merges rows of existing customers—both exact and fuzzy matches—into one dataset.
Fuzzy Grouping transformation The Fuzzy Grouping transformation, Fuzzy Grouping, groups customers who are likely duplicates. The transformation adds three columns keyin, keyout and score to each row. keyin is a unique identifier assigned to each input row and keyout contains the particular keyin assigned to the row that best represents all the rows in a fuzzy group. All rows in a fuzzy group will have the same keyout value. The score column is a value between 0.0 and 1.0 that describes the textual similarity between a given input row and the row selected to be the canonical value.
OLE DB destinations The OLE DB destination, OLE DB Destination - Existing Customers, inserts rows into the ExistingCustomerLeads table.The OLE DB destination, OLE DB Destination - Unique Customer Leads, inserts rows into the NewCustomerLeads table.The OLE DB destination, OLE DB Destination - Duplicate Customer Leads, inserts rows into the DuplicateCustomerLeads table.
Execute SQL task The Execute SQL task, Return Row Count for ExistingCustomersLeads table, runs a parameterized SQL statement. The statement returns a count for the number of rows in the ExistingCustomerLeads table where the Similarity column has a value of 1. The SQL statement stores this row count as a Single row result set in a package variable, RowCount. The Similarity column value is stored in a package variable that is mapped to the parameter in the SQL statement. A _Similarity value of 1 means that there is an exact match between the customer record in the input table, CustomerLeads, and the customer record in the Lookup reference dataset.
Script task The Script task, Display Row Count, displays the row count value for the ExistingCustomerLeads table, based on the value stored in the RowCount package variable.
File connection manager The File connection manager, CreateTables.sql, connects to the file that contains the SQL the package uses.
OLE DB connection manager The OLE DB connection manager, (local).AdventureWorks, connects to the AdventureWorks database on the local server.


The following table describes the data in the output tables.

Table Description
ExistingCustomerLeads Contains records that exactly match an existing customer, and records that fuzzily match an existing customer with very high textual similarity.
NewCustomerLeads Contains records for which there was no good match to an existing customer. If the list contained multiple instances of the same name, or a highly similar version of a particular name, only one record will be directed to NewCustomerLeads, and the duplicates will be directed to DuplicateCustomerLeads.
DuplicateCustomerLeads Contains duplicates of new customers.


Sample Results

To see the execution results of the Data Cleaning sample package, run the following Transact-SQL query:

Select * from AdventureWorks.FuzzyLookupExample.ExistingCustomerLeads
Select * from AdventureWorks.FuzzyLookupExample.NewCustomerLeads
Select * from AdventureWorks.FuzzyLookupExample.DuplicateCustomerLeads |



© 2008 Microsoft Corporation. All rights reserved.

Last edited Feb 27, 2009 at 12:36 AM by sabottaca, version 13

Comments

No comments yet.