So, You Say You’re Data’s Clean, huh

So, You Say Your Data’s Clean, Huh?!?

Joseph R. Hudicka

Dulcian, Inc.

I have led a number of data migration efforts for several organizations and each experience has led me to the conclusion that nobody really understands data migration. The complexities of merging heterogeneous data sources are always underestimated. Sufficient time for analyzing data for common patterns is never allocated.

Everyone wants to build the optimal ‘enterprise information system’ - one that is completely normalized and tuned to perfection. This is the system that makes all information of value to the organization globally available to all user communities. Often overlooked are the keepers of the legacy data. These people often want nothing to do with data migration. This common situation leads me to ask the following questions:

Is every member of a given organization on the same page?
Do we share a common definition of terms such as ‘Data Migration’?

With the speed at which systems are being changed, upgraded, enhanced, redesigned or reengineered, usually requiring some form of data migration, the time has arrived for industry standardization in Data Migration.

Definitions

Perhaps the best place to start is by carefully defining our terms to be sure that everyone is talking about the same things. Data migration, data cleansing/scrubbing, and data transformation are often used interchangeably in the IT industry. The problem here is that data migration is not solely confined to the simple act of transporting data from one environment to one or more additional environments. Sometimes data must be reformatted. In other instances, data must be combed through, via operations that remove or replace inappropriate characters. There are other scenarios where redundant data must be merged/purged. All of these procedures are part of the larger process we refer to as Data Migration.

The following vocabulary list should be used consistently when referring to these items or processes.

Source – A system that serves as a supplier of data to be migrated to one or more TARGET systems.
Target – A system that serves as the recipient of data from one or more SOURCE systems.
Dirty data – This is a catch phrase referring to a subset of data that does not conform to system standards. These system standards are typically set for the Target system since, in many cases, they were not supported in the legacy or Source systems.
Data Extraction – The process of pulling out a set of data from one or more SOURCE systems.
Data Transformation – The process of reformatting a data set from one or more SOURCE systems to conform to the TARGET data structures in the TARGET system
Data Cleansing/Scrubbing – The process of removing/replacing inappropriate characters prior to loading data into the TARGET system.
Load Interval – The frequency at which the entire data migration life cycle is repeated for a given project. Data Warehouses typically require much higher load intervals than Transactional Systems.
Collision – Occurrence of duplicate values on unique IDs or descriptive fields in the same system.
Data Migration – The collective process of Data Extraction, Transformation and Cleansing, to move a data set from one or more environments to one or more additional environments.

From these definitions, you can see that Data Migration is not simply a single act, but a cyclical process that is composed of multiple activities. It is not as simple as ‘moving data’. Foreign data sets must be merged, data must be ‘cleansed’ and transformed. In some cases, these functions cannot always be automated.

"Dirty Data"

The following are categories and examples of data that are problematic in many data migration efforts.

1. Non-standardized value sets

The key to data integrity is defining a set of valid characters that support the nature of data for a given data element. Relational databases only support this notion to a certain extent. For example, defining a column of datatype NUMBER prevents non-numeric characters from being entered into this column. However, a VARCHAR2 column would not prevent numeric characters from being entered into a column of this type because they are also characters.

Referential Integrity constraints also support standardized value sets, by forcing references to valid values from other data structures. This prevents redundant data entry that effectively results in collisions (described later in this paper).

a) Invalid Characters

Unfortunately, databases do not inherently support atomic, standardized value sets, or the definition of an accepted character set for a specific column. Therefore, non-standard characters can be entered by unknowing users. We could write some PL./SQL code to protect against such entry, but clients often assume that it won’t be necessary since we can ‘teach’ the user community not to use these characters during training. In my experience, no amount of training will prevent the inclusion of non-standard characters in the database. To date, I have never encountered a data migration that didn’t include the following characters ‘!@#$%^&*()’ in fields where they simply did not belong. This is problematic since these characters would not be supported by the TARGET system.

b) Invalid Character Combinations

Another type of non-standardized value set violation is an invalid character combination. The example below shows a column in an ADDRESS table that stores valid STATE abbreviations. The letter ‘Z’ would be in the valid character set for the STATE field, because it is a component of the abbreviation for Arizona ‘AZ’.

ADDRESS TABLE

ADDR	1 WEST AVE
CITY	LAWRENCEVILLE
STATE	AZ
ZIP	45938

However, the combination of this character showing twice in the same entry (i.e. ‘ZZ’) is invalid. This error is prevented by implementing a business rule that validates entries against a list of acceptable values. In Oracle, we would create a second table, STATE, and create a foreign key from ADDRESS to STATE, as shown below.

CREATE TABLE STATE

(STATE VARCHAR2 (2) NOT NULL PRIMARY KEY,

DESCR_TX VARCHAR2 (20) NOT NULL)

CREATE TABLE ADDRESS

(ADDR_ID NUMBER(10) NOT NULL PRIMARY KEY,

ADDR VARCHAR2 (45) NOT NULL,

CITY VARCHAR2 (45), NOT NULL,

STATE VARCHAR2 (2) NOT NULL REFERENCES STATE(STATE),

ZIP VARCHAR2 (9) NOT NULL)

Unfortunately many migration load intervals are periodic and therefore cannot prevent additional data violations on the legacy systems. In order to prevent these continuous data violations, you must request code enhancements on the legacy system. This suggestion is often met with great resistance by IT managers who cannot understand the need for spending more money to enhance a system that is being replaced. If you can demonstrate that the cost of preventing these data violations is less than the cost of migrating invalid data or having to cope with invalid data on an ongoing basis, it may be possible to perform this necessary change to the legacy system.

Collisions

Collisions ****

Collisions can occur in one of the three following circumstances:

Merging heterogeneous data sources
Lack of preventative code in the legacy system
Multiple unique identifiers pointing to redundant descriptions.

Each will be discussed separately.

a) Merging Heterogeneous Data Sources

When merging heterogeneous data sources, formatting tends to be a very common problem area. For better or worse, systems design has been more of an art form than a standard. Therefore, different systems often use different techniques to implement very similar concepts. For example, the implementation of a person’s name may show up in any of the formats shown below:

Table #1	Table #2	Table #3	Table #4
Last name	Last name	Last Name	Name
First name	First name	First Name
Middle Initial	Middle Initial
Surname

The problem here is not the structure but the freedom afforded to the user to decide what data is important, and where that data belongs. Using the name ‘Jonathon Q. Allehandro II’ as an example, we will apply this name to the table structures shown above.

Table #1	Table #2	Table #3	Table #4
Allehandro	Allehandro II	Allehandro II	Jonathon Q. Allehandro II
Jonathon	Jonathon	Jonathon Q.
Q	Q
II

As you can see, these are just a few derivations of the same data that could be entered into a given system. In fact, multiple combinations could be entered into the best system, and it would not be able to determine that they refer to the same person. Your tolerance for integrity violations will be reflected by the amount of effort you devote to preventative code.

b) Preventative code

The core purpose of preventative code is to support an organization’s business rules by enforcing validity throughout the data entry process. Oracle supports integrity constraints in the database design which ensure uniqueness in integrity constraints, validity in referenced data sets, and even more advanced comparisons via check constraints.

Most legacy platforms were not as flexible, leaving these sort of validity checks up to the programmers. As a result, many business rules would only be partially implemented, if at all. In many instances, I have seen required fields that do not validate the value during data entry. An example of this is a field such as "STATE" that requires population; however, quite often the system does not validate the entry. The result is many rows with values such as decimal points and other characters instead of valid state abbreviations.

c) Duplicate Unique Identifier Values

Collisions on unique identifiers typically arise when merging heterogeneous data sources. When extracting personnel information from two different sources, there is a chance that both sources may generate unique identifiers of the same value, in spite of the fact that they are pointing to entirely different personnel entries. Unique Identifiers are typically generated automatically by some system function or program. The most common sequencing known to systems is a counter that starts with ‘1’ and increments by ‘1’, ad infinitum. However, even with this sequencing, it is possible that two systems could be accumulating personnel information for different purposes and identifying them with an overlapping pattern.

Common Causes of Dirty Data Problems

So what makes data ‘dirty’? There are actually a number of events that could lead to dirty data. Most often, dirty data is the direct result of systems that perform little or no data integrity validation. By now it should be no surprise that many legacy or Source systems have been less than efficient in their data integrity validation. In fact, this is one of the primary reasons for organizations to move to a relational database environment. Unfortunately, the problems do not simply go away by themselves. The process of Data Migration often reveals these weaknesses and may require a massive effort due to the large amount of dirty data that must be cleansed and moved to the TARGET system. Some common causes of dirty data are described below.

a) Unsupported Business Rules

A whole new source of dirty data may come from systems where data integrity is left up to users, namely ‘User Creativity’. While most integrity violations were caused by careless data entry, it’s the ‘Creativity Factor’ that causes the most difficulties. The Creativity Factor arises from users attempting to satisfy their own system enhancement requirements by overloading columns with more data than they were intended to support.

A very common example is the case of a ‘Text’ field, intended to capture free form notes. This ‘Text’ field evolves over time, capturing a great deal of other information. This additional information that was not intended to be included in the text field may begin to take on a common pattern. This would be great, if the pattern itself did not continually evolve. As in the Name example from earlier in the paper, this type of freedom may be perceived as system flexibility in the restrictive legacy environment. However, when the time comes for data migration these text fields will cause problems.

b) Assuming Data Is Clean…

Even after you’ve asked all of the right questions, data migration still may not proceed smoothly. The ‘keepers of the data’ may insist that the SOURCE data is perfectly clean and all that remains to be done is to simply move the data to the new system. However, relational databases do not lie. They force integrity, and report the exceptions.

When you assemble and present your Exception Report to the same group, they will most likely say ‘This data has been running our company for years, as is. Why mess with a good thing?’ This statement should not be accepted by those attempting to migrate the data and develop the new system. While it may be true that the company has been able to make the best of a poorly designed system, they need to be shown that the new system will prove to be far more efficient given a solid environment.

c) Choosing the wrong location to transform data

In those organizations where there is strong "Anti-Conversion" sentiment, you will most likely be urged to transform data on the legacy platform, while performing no cleansing activity. There may be some benefits to this approach. It is sometimes the appropriate choice, but most of the time, this is not the case.

Performing the data transformation on the legacy platform allows you to leverage the internal organization’s current skill set to write the transformation routines quickly. These routines can create preformatted text files that can be loaded into Oracle via SQL*Loader. This is the fastest method for moving data into Oracle currently available.

Unfortunately, when written in languages such as COBOL the total volume of code tends to require far more programming resources than PL/SQL. For example, in one project 30,000 lines of COBOL code were replaced with less than 1,000 lines of PL/SQL code, not to mention the reduction in overall time spent writing the code.

Another significant drawback to doing the data transformation on the legacy platform is the discovery of all of the dirty data that exists in the legacy system. This can be a politically sensitive issue for the organization requiring a fair amount of damage control particularly if the people who designed and maintained the legacy system are still present in the organization. In some cases, the entire data transformation effort must be scrapped in lieu of a thorough data cleansing analysis process.

If this situation is likely, why not prevent it by performing the cleansing analysis early in the project’s life cycle? At Dulcian, we strongly advocate using a data migration methodology that requires that data cleansing assessment take place during the Strategy and Analysis phases of the SDLC.

Error Violations during Data Migration

Oracle provides an extensive list of error messages under various technical categories, many of which can be raised during data migration as a result of an unsatisfactory condition. Unfortunately, most of them are not nearly detailed enough to give you useful information about the problem..

Errors of Use

The INSERT statement shown below is intended to populate the EMP table, which is described just below the INSERT statement..

INSERT INTO EMP (FNAME,LNAME,POSITION)

VALUES (‘JOHN’, SMITH’);

EMP

EMP_ID NUMBER (10) NOT NULL PRIMARY KEY,

FNAME VARCHAR2 (20) NOT NULL,

LNAME VARCHAR2 (20) NOT NULL)

Since the EMP_ID column, which is mandatory, has not been accounted for in the INSERT statement, the following error is generated:

ORA-01400: cannot insert NULL into ("USER"."TABLE"."COLUMN")

This error message not only tells you the nature of the error that occurred, but also tells you which column of which table initiated the error. Ideally, this is how all of the errors should be formatted. Another useful error message formatted this way is shown, below.

ORA-01407: cannot update ("USER"."TABLE"."COLUMN") to NULL

The following are examples of errors frequently encountered during data migration that offer very little useful information.

ORA-06502, "PL/SQL: numeric or value error"

ORA-01401: inserted value too large for column

ORA-00904: invalid column name

ORA-00911: invalid character

ORA-00936: missing expression

ORA-01481: invalid number format model

ORA-01438: value larger than specified precision allows for this column

ORA-01461: can bind a LONG value only for insert into a LONG column

ORA-01462: cannot insert string literals longer than 2000 characters

ORA-01476: divisor is equal to zero

ORA-01858: a non-numeric character was found where a numeric was expected

Data Quality Analysis

There are two distinct approaches when it comes to data integrity analysis: "Steamroller" and "Bump & Run." In the Bump & Run approach, we build a migration routine, execute it, and watch it uncover each and every integrity violation, one by one. In one project I discovered 4.5 million data integrity violations.

For those projects where the overall data volume is low, or the quality of data is truly known to be pristine, a series of helpful queries can be run just to validate what is known. In writing these queries, the areas to focus on should be checking for invalid dates and UID collisions.

Select date,count(*)

from tab

group by date

Select id,count(*)

from tab

having count(id) > 1

For those of you that have recognized serious data integrity concerns at the very beginning of your project, a set of automated pre-transformation reports are extremely useful. As a cumulative effort based on every data migration project I have ever participated in, I designed several canned reports that scurried through the data prior to enabling constraints in the TARGET account, identified flaws in the data and aggregated them. In other words, I automated the searching process to seek out the distinct set of non-standard values prior to mapping and code execution. This approach has literally saved thousands of person hours across the projects I have participated in.

An added benefit of using an automated tool are that it can be started it, and allowed to run for between a few hours and a few days without manual intervention. When finished, the entire data integrity analysis effort will be completed and its accuracy will be guaranteed. The alternative is to run each query against every column manually.

Automating Data Analysis

I have already assembled a number of valuable reports and I am sure that there are many others I haven’t even thought of yet. For example, I prepared a report that provides the following series of counts and percentages:

Total rows per table
Total values per column & percentage
Total unique values per column & percentage

This report tells me precisely which columns are candidates for migration, because it clearly identifies those data elements that have no data. If there is no data. this means that there is nothing to migrate. Prior to creating this report, I participated in several debates about the necessity of migrating specific data elements only to find, at a later point in time, that they contained no data. This is a powerful reason for using the types of automated reports mentioned above.

Code examples for the functions, SORT and SEARCH are shown as examples below. The SEARCH function parses through every entry for a given table/column combination, and extracts the distinct character set. In the end, it passes this character set to the SORT function, which organizes the distinct character set in ascending order.

Search Function

CREATE OR REPLACE FUNCTION SEARCH (PAR_TABLE_NAME VARCHAR2 := NULL,

PAR_COL_NAME VARCHAR2 := NULL)

RETURN VARCHAR2

CURSOR_HANDLE INTEGER;

EXEC_CURSOR INTEGER;

OUT_VALUE VARCHAR2(2000) := NULL;

FINAL VARCHAR2(2000) := NULL;

TEMP_1 VARCHAR2(1) := NULL;

BEGIN

CURSOR_HANDLE := DBMS_SQL.OPEN_CURSOR;

DBMS_SQL.PARSE(CURSOR_HANDLE, 'SELECT '|| PAR_COL_NAME ||' FROM ' ||

PAR_TABLE_NAME,DBMS_SQL.V7);

DBMS_SQL.DEFINE_COLUMN(CURSOR_HANDLE,1,OUT_VALUE,2000);

EXEC_CURSOR := DBMS_SQL.EXECUTE(CURSOR_HANDLE);

LOOP

IF DBMS_SQL.FETCH_ROWS(CURSOR_HANDLE) > 0 THEN

DBMS_SQL.COLUMN_VALUE(CURSOR_HANDLE,1,OUT_VALUE);

FOR I IN 1..LENGTH(OUT_VALUE) LOOP

TEMP_1 := SUBSTR(OUT_VALUE,I,1);

IF INSTR(FINAL,TEMP_1) != 0 THEN

NULL;

ELSE

FINAL := FINAL || TEMP_1;

END IF;

END LOOP;

ELSE

EXIT;

END IF;

END LOOP;

FINAL := SORT(FINAL);

DBMS_SQL.CLOSE_CURSOR(CURSOR_HANDLE);

RETURN NVL(FINAL,'NULL IN COLUMN');

EXCEPTION

WHEN OTHERS THEN

DBMS_SQL.CLOSE_CURSOR(CURSOR_HANDLE);

RETURN '"'||'NULL IN COLUMN'||'"';

END;

Sort Function

CREATE OR REPLACE FUNCTION SORT (PAR_STRING VARCHAR2)

RETURN VARCHAR2

TYPE SORTED_STRING_TYPE IS TABLE OF VARCHAR2(2000)

INDEX BY BINARY_INTEGER;

SORTED_STRING SORTED_STRING_TYPE;

OVER VARCHAR2(1) := 'N';

TEMP_I VARCHAR2(1) := NULL;

FINAL VARCHAR2(2000) := NULL;

BEGIN

FOR I IN 1..LENGTH(PAR_STRING) LOOP

SORTED_STRING(I) := SUBSTR(PAR_STRING,I,1);

END LOOP;

WHILE OVER = 'N' LOOP

OVER := 'Y';

FOR I IN 1..LENGTH(PAR_STRING) - 1 LOOP

IF SORTED_STRING(I) > SORTED_STRING(I+1) THEN

TEMP_I := SORTED_STRING(I);

SORTED_STRING(I) := SORTED_STRING(I+1);

SORTED_STRING(I+1) := TEMP_I;

OVER := 'N';

END IF;

END LOOP;

FOR I IN 1..SORTED_STRING.LAST LOOP

FINAL := FINAL || SORTED_STRING(I);

END LOOP;

RETURN NVL(FINAL,'NULL IN COLUMN');

END;

Keep in mind that the idea is to automate wherever possible. The functions listed above are extracted from a report, which has an Oracle Forms front-end, allowing the selection of any number of tables to run this report for from a single request.

Another very useful report would provide not only the data structure definition of each legacy data element, but also the distinct character value set found within that column. This can then be compared to the standardized character value set as determined in your analysis. If you find that ‘One of these things is not like the other,’ this is useful information to have before rather than after the code has been built.

Data Mapping Tools

It makes much more sense to perform data transformation and cleansing on the TARGET system for a number of reasons. A complete discussion of this topic can be found a paper entitled , ‘The Complete Data Migration Methodology’(available under Publications at www.dulcian.com ). However, it is important to note here that there are a number of additional Data Quality Analysis reports that are worth their weight in gold, although these reports are only available through the use of a data-mapping tool.

Data Mapping tools do not necessarily have to generate code. Some specialize in areas such as data quality analysis, or mapping. Others embrace the entire data migration life cycle. Regardless of which you choose, you should definitely have one in your data migration arsenal.

By extracting data mapping rules from code and tracking them within a common repository, you can generate many reports to streamline the mapping effort. For example, you could generate reports that can identify unmapped SOURCE and/or TARGET data elements. Reports like this serve as a monitor of progress. The lower the percentage of unmapped SOURCE elements, the higher the percentage of mapped TARGET elements, indicating that your project is closer to completion. You may even take advantage of more comprehensive Pre-Analysis reports, such as one that would compare SOURCE and TARGET data structures. This report can identify mismatching lengths of data elements, which will generate the following error:

ORA-01438: value larger than specified precision allows for this column

Unfortunately, like many of the errors listed earlier, this one does not tell you the name of the offending column, or the table it resides in.

Conclusions

The main lesson to learn from this paper is, ‘Read your data before you map it.’ You may be very surprised by the information it provides. Performing a thorough analysis of the data first will result in rapid identification of data cleansing requirements that immediately transfer into mapping rules.

As for the mapping process itself, I don’t recommend conducting a data migration effort without a comprehensive data migration management tool suite. The common repository allows for the generation of flexible reports that serve as far superior communication tools between developer and user community than trying to communicate by reviewing program code.

Lastly, plan on conducting many iterations of the data migration process. It will not be completed in one pass. Don’t devote more time to mapping than to executing maps. Complete a first cut of mapping as quickly as you can, and then start executing code. The rest of the cleansing rules will be found through actions taken in performing the migration process itself, not through any possible research. Some integrity issues will inevitably slip through the cracks of analysis, and won’t be caught until the code is run.

About the Author

Joseph R. Hudicka is the CEO of Dulcian, Inc (www.dulcian.com). He is a member of the Oracle8 Customer Advisory Council providing insight regarding future enhancements to the Oracle8 Server technologies. He is a systems design specialist for Decision Support (OLAP) and Online Transactional Processing (OLTP) systems with specific business knowledge in finance, health care, pharmaceutical, and manufacturing areas. Mr. Hudicka is the chief architect of DataMIG™ (Patent Pending), a complete data migration management tool suite. He delivers presentations regularly at major Oracle conferences and local Oracle user groups. Joe can be contacted at jhudicka@dulcian.com.