DataMIGä – A Data Migration Management Tool Suite

Joseph R. Hudicka

Dulcian, Inc.

June 9, 1998

Abstract

This paper describes DataMIGä – a sophisticated product capable of handling arbitrarily complex maps, supporting data migration for legacy systems and data warehouses. DataMIGä is a complete Data Migration Management Tool Suite. DataMIGä offers the following key features:

The tool uses a point and click GUI interface built in Developer/2000. The back-end, which actually generates the scripts and triggers, is written in PL/SQL. All maps are stored within a Designer/2000 style repository, capable of storing arbitrarily complex data maps.

Current Environment

Within our company, we have built very sophisticated templates to support the Designer 2000/Developer/2000 product suites. Currently, these products give us a nearly ideal environment for the design and development of applications. They allow is to design and build more quickly than we could have thought possible even a year ago. The coming advances in Designer 2000 2.0, Developer 2000 2.0 and Oracle8 will allow us to achieve even greater productivity.

This increase in productivity enables most portions of the SDLC to be accomplished easily and quickly. In addition, the systems that we build are designed flexibly so that they can be relatively easily modified as business requirements change. The increased productivity in design and development has brought the issue of data migration more clearly into focus.

Prior to the development of DataMIGä , a standard systems development project required spending approximately 50% of the time doing nothing but data migration. Data migration was the single largest component of any development project.

For data warehouses, the problem of data migration is even more acute. In one case, a data warehouse project spent 80% of the total budget on the data migration script. In these types of projects, not only do one-time migrations need to be accommodated, but periodic refreshes of the warehouse are also required. These periodic migration scripts require much more careful construction and tuning. Such scripts must also be maintained over time as the warehouse and legacy systems evolve. A similar problem exists in OLTP systems with data feeds that are external to the organization. These scripts, which must serve as the foundation for EDI applications between organizations, must also be tuned and maintained for periodic data feeds.

Classes of Data Migration

There are two distinct classes of data migrations:

  1. One time loads: With one-time loads, data is moved over only once. There is less need to be concerned about the quality of code needed for this type of migration. These migrations generally happen when a company is migrating from a legacy system to a new production system. All historic data must be moved from the old system to the new.
  2. Periodic loads: This type of migration occurs repeatedly over time. More care must be taken with the code of these migration scripts since they are reused many times. These types of data migrations happen in data warehousing environments where reporting is the primary function of the database and the database is fed by one or more sources, internal or external.

Any data migration requires a variety of mapping strategies. Migrations may include any or all of the following: one-to-one table mappings, aggregated summary tables based upon conditional logic and update/purge logic for periodic loads. All of these formats, and a variety of others, must be supported for a migration to succeed.

DataMIGä Background

We encountered a project with a very complex data migration component from a source outside the organization. We recognized that simply writing the data migration script in PL/SQL would have left the organization with an unmaintainable program, several hundred pages in length. It was therefore essential that we use some type of data migration tool where data map specifications could be stored in the repository and an efficient user interface was present to generate and maintain those data maps.

Data Migration Tools

In looking for a product, we set forth a number of criteria for our search:

  1. The data maps should be stored in a centralized repository with a user-friendly interface, similar in concept to Designer/2000.
  2. The tool had to support easy construction of arbitrarily complex maps.
  3. The tool had to work totally within an Oracle data environment. This criterion was necessary because maps must allow already migrated tables to be used as sources for other maps. Without this ability, maps are restricted to the original source tables, making mapping scripts inordinately complex and grossly inefficient.
  4. The tool should generate SQL, PL/SQL and SQL*Loader scripts that can be supported by standard Oracle developers. We originally hoped that we could use a tool that generates strictly SQL code that could be database independent. However, SQL alone is insufficiently rich to support the required complexity in many data maps. Furthermore, PL/SQL gave us much greater flexibility in generating optimized code.
  5. All of the functionality of designing and specifying data structures is already supported in Designer/2000. Ideally, we wanted a tool that would easily access the information from the Designer/2000 repository where we had already designed our data model for the new system.

Over the past year, several product offerings have come to market that enable analysts to map legacy systems to future production systems, and, in turn, generate the code with which to perform the data migration.

Our vision was to have an integrated repository where the data structure definitions for all of an organization’s systems including the mappings between legacy and future systems can be stored and maintained. This same repository should generate SQL and PL/SQL code to perform data migrations since these languages are native to Oracle.

Needless to say, the tools on the market fell far short of our expectations. Several flaws were apparent:

  1. The available products devoted much of their development effort to supporting multiple source and multiple target architectures and less effort to the entire migration process.
  2. Most of the tools were not capable of supporting particularly complex maps and did not generate adequately optimized code to support large volume data migrations.

Therefore, to satisfy the needs of the project, we decided to build our own tool, which would meet all of our search criteria. DataMIGä was designed to incorporate all of the following aspects of a robust data migration utility:

 

The Structure and Functions of DataMIGä

To support complex data maps, we needed a rich mapping environment. The traditional construct of mapping tables and columns from a legacy system to tables and columns in a new system or data warehouse is not nearly sufficient to support a data migration project. We need to support the following:

  1. Lookup tables in the destination system may have to be user defined since they may not exist anywhere in the legacy system.
  2. Appropriate primary keys may not exist in the legacy system and may have to be sequence generated.
  3. No matter how complex we make our data migration utility, there will always be some map logic that will go beyond any simple structure. Therefore, the system must include PL/SQL functions to support these complex maps.

Each map takes as its source any of these objects or already migrated destination tables. Then, the map writes to a destination table. Finally, maps need to be able to be called as sub-maps. This allows relatively simple maps to be combined into a single, complex map. This structure is illustrated in Figure 1.

Image90.gif (9662 bytes)

Figure 1 – DataMIGä model

The following are examples of the complexity that DataMIGä is capable of supporting.

  1. Two Source Columns to One Destination Column

    The source system has two columns for zip code, ZIP and ZIP4. We want to combine them into one ZIP column in the destination system. We can create the following formula:

    SOURCE.ZIP || ‘-‘ || SOURCE.ZIP4

    You select the source columns from an LOV, which shows all columns for the source tables for the current map.

  2. One Source Column to 2 Destination columns

    The source system has one-name columns with the format LAST NAME, FIRST NAME. The new system has LNAME and FNAME columns. We can perform this by building the following formulas:

    Last Name

    SUBSTR(EMP.NAME,1,INSTR(EMP.NAME,’,’,1,1)-1);

    First Name

    SUBSTR(EMP.NAME,INSTR(EMP,NAME,’,’,1,1)+1);

  3. Data Transformation

    The old system has a status field that uses "A" for active and "I" for inactive. The new system has an ACTIVE_YN field that should be set to Y for active and N for inactive. This can be performed by building the following formula:

    DECODE(SOURCE.STATUS,’A’,’Y’,’N’);

  4. User Created Function

    For an accounting system, in order to perform a complex operation on a value before copying it to the new system, put this computation into a function at the database level. Next, enter the syntax for the function call into the system. Then, use the function wherever you need to.

    We have provided this functionality to support conditions where the standard GUI interface does not support the necessary complexity. In practice, this feature is rarely used. Our experience has been that 95-100% of maps in a data migration project are supported through the basic functionality within DataMIGä .

  5. Using Destination Tables as Sources

    To ensure that the old key is copied into the new table, after the master table has been populated, you can link the master destination table with the source table using the old primary key. You can then store the new primary key in the detail table.

  6. Multiple Source Tables to 1 Destination Table

    For an accounting system, four different tables were holding accounting information for one record. We wanted to combine these into one table in the new system. DataMIG can support any number of joins with any filter clauses necessary.

  7. One source record to multiple target records

    In the old system, we had client name, address, and phone number information on one record in one table. In the new system, we wanted to create one record in the client, address, and phone tables. We created a master map that inserted the client information into the new client table. We then created two child maps, one for the address and one for the phone. To populate the Client ID foreign key for these tables, we used the client sequence number. Since DataMIGä recognized that the name of the sequence number was not the name of the table, CURRVAL instead of NEXTVAL was used for the sequence number.

  8. Filtration of Data using Logical syntax

    We only want to create a record in the address table if the address was filled in. We can create conditional logic such as the following:

    IF source.addr is not null THEN

    CALL_MAP(‘ADDR_CLIENT’);

    END IF;

  9. Lookup table population

Some legacy systems do not have lookup tables. We can populate them in one of two ways:

  1. Periodic Loads via Updates

DataMIGä can build a script that will check to see if the data exists. If it does, update the record, otherwise, insert a new record.

Data Transformation and Cleansing

In order to be useful, all migration tools must handle data transformation and cleansing. Data transformation is the process by which we massage the format of source data into that of the destination data. A good example of transformation would be taking a source column which stores DATE information in a textual format, and changing it to one of Oracle's proprietary data formats.

Cleansing on the other hand is the process of cleaning invalid data during its migration to the destination. This is where we might change those STATE = '.' values to something like "NA" for "NOT AVAILABLE."

Data transformation and cleansing are implemented in DataMIG through functions and conditional logic.

Script Generation Languages

DataMIG supports generation in 2 Oracle native formats.

SQL

SQL is useful for two very common migration needs: one-to-one mappings, and aggregations. One-to-one mappings are the kind where a particular source file, i.e. Departments, is used to populate one, and only one, destination table, i.e. DEPT. In this case, there is no conditional logic to apply, thus eliminating the need for a PL/SQL solution. Aggregates are summary tables that snapshot a data set at a level higher than the atomic level. While parallel querying and mega-processor servers are becoming more and more abundant, aggregates now serve reporting needs in a cost-effective manner. Aggregates rarely depend upon conditional logic, and are easily implemented through a SQL solution.

PL/SQL

Periodic loads require more than simple "INSERT INTO" syntax. "Updates" and, at times, "Deletes" are necessary when refreshing a warehouse instance.

It should be clear, however, that one set of code can and should accomplish both the initial migration and the periodic refreshes. This is where PL/SQL is extremely useful.

PL/SQL's conditional logic provides us with the capability to determine whether or not a particular row already exists in the destination, and conclude whether an Update or Insert is appropriate, row by row.

Parallelism

DataMIG supports a full network structure of tasks. Every time a task is completed, the available task list is searched and every appropriate task is started. The true objective of any data migration is to seamlessly integrate every source element into a new database instance. All too often, however, the focus tends to be on "How quickly can we migrate this data set?" DataMIGä offers two methods for the execution of your migration scripts: Sequential and Parallel

A. Sequential Migration

In a sequential data load, scripts are executed based upon relational dependencies between tables (i.e. lookup tables are populated before tables that refer to them, etc.). Scripts are executed one at a time, until all scripts have been processed.

For example, I have three tables, PRODUCT, DEPT and EMP, where EMP has one foreign key, to DEPT. By the sequential method, we would load these tables in the order of PRODUCT, DEPT and EMP, one after the other, as a result of EMP's dependency upon DEPT (see Figure 2), realizing a total migration time of seventy minutes.

Image86.gif (3712 bytes)

Figure 2 - Sequential Data Migration

 

B. Parallel Migration

In Parallel mode, DataMIGä analyzes the dependencies between tables in further detail, and executes scripts the moment that every dependant table is populated.

The parallel scheduling engine realizes that PRODUCT has no dependencies to any other table, and would therefore execute the PRODUCT map simultaneously with the DEPT and EMP maps, which do have a dependency between them (see Figure 3). The resulting total migration time is now reduced to forty minutes, a savings of 30 minutes overall, or better than 40% of the total migration!

Image87.gif (3360 bytes)

Figure 3 - Parallel Data Migration

As you can see, the Parallel mode provides a clean method to drastically reduce the overall processing time of any data migration.

C. Constraint/Index Management Utility

Constraints and Indexes cause sub-optimal performance of migration scripts. Every record inserted into a table also requires updates to all existing indexes. DataMIGä offers a point and click interface to seamlessly manage constraints and indexes during migration process.

D. Automated Error and Integrity Violation Tracking

The parallel scheduling engine offers a compiler to ensure that mappings are syntactically correct prior to their execution. This step alone eliminates 90% of most migration script failures, prior to running them. From here, the scheduling engine tracks any and all execution errors electronically.

Integrity Violation tracking can be performed during execution or after. The most productive method of tracking integrity violations is after the maps have been executed. This way, every integrity violation will be recorded. If you choose to track these violations during execution, you will only identify the first violation, as it causes an error, forcing the map to stop running.

Comprehensive Reporting

The completeness of migration scripts is very difficult to identify. Migration itself is iterative in nature. DataMIGä offers a series of quality assurance reports that help determine the state of the migration project. Reports are available by map, across maps for impact analysis, impact analysis between Designer/2000 and DataMIGä , and audit reports that identify both SOURCE and TARGET columns that have not been mapped. Another report drills into the nature of your source data by providing counts of total records, total values per column and total unique values per column, highlighting columns that require special transformation rules, or may not require mapping, at all.

Conclusions

Data migration is becoming an increasingly important part of the overall systems development process. DataMIGä can simplify and expedite this process by doing the following:

DataMIGä is a sophisticated product, capable of handling arbitrarily complex maps, supporting data migration for legacy systems and data warehouses. It will also generate triggers to keep any two heterogeneous Oracle databases in synch.