Data Migration Strategies for Data Warehousing

DATA MIGRATION STRATEGIES FOR DATA WAREHOUSING

Joseph R. Hudicka, Sergey Guberman & Dr. Paul Dorsey

Dulcian, Inc.

Overview

By now, we are all aware of the fact that data warehousing is nothing more than a new, flashy name for an age old wish - a reporting system which delivers all of our data, all of the time, to all of our users, in a timely manner. It is a process of transforming raw data into useful information.

However, as reality sets in, we find that this is still no small task. There are usually several data sources which must be integrated into one common repository. The fact is, many data warehouse projects are really legacy system migrations which can take advantage of newer, scaled hardware architecture, making migration an affordable alternative.

This paper will review our findings based upon two data migration projects. First, the scope of the project will be defined. Second, we will review how we accomplished our task. Lastly, we will discuss how we would probably do these projects today.

Project Description

Our task was twofold. We were to perform the migration of a legacy reporting system, while integrating the data structures of this former system with those of the new production system. Unfortunately, we had no involvement with the production system, or OLTPS (On-Line Transactional Processing System) migration.

Imagine yourself as that little ship in the hi-tech 1970’s video game, "Space Invaders" trying to hit the targets scrolling across the screen above. Designing integrated data structures based on an evolving OLTP system is quite similar. Unfortunately, however, this is one skill we cannot master from practice, since every system brings with it its own oddities.

In this particular case, of course, requirements documents did not exist, making code specifications relatively impossible to assemble. It became quite clear that since no significant advances had been made up to that point, we would have to begin by developing project standards.

It certainly came as no surprise that the legacy data structures did not map terribly well to those of the new OLTPS. What was surprising, however, was that the OLTPS migration team had already built an engine to transform legacy data into the new system’s format (a bit of information which might have been awfully useful early on in the project). Figure 1 below depicts the scope of these two projects, simultaneously.

Image33.gif (5163 bytes)

Fig. 1 - Project Scope

To complete our task, we were required to write two separate transformation engines - one to migrate data from the legacy system to the warehouse, and the other to migrate data from the new OLTPS. What’s wrong with this picture? Notice the dotted line pointing from the legacy system to the new OLTPS. This is a transformation which had already been built by another group, which converted legacy data into the format of the new OLTPS. Instead of forcing all historic data through this one conversion process, we had to write and maintain two separate processes, remaining ever aware that the warehouse might receive the same transactions from both source systems.

One final quirk was that the users required the ability to query by both the old and new coding schemes. This would prove to be our performance nemesis.

How Data Migration Was Accomplished

Our team was divided equally: three members of our group and three business analysts from our client. The first deliverable was a series of spreadsheets (See Table 1) which were used to map the old and new data structures, while capturing business definitions, data formatting definitions, and data sources. These spreadsheets were a focal point of the project - they forced people to take ownership of tasks by documenting their findings, leaving them open for discussion.

Business Definition

NEW

TABLE

NEW

COLUMN

LEGACY

TABLE

LEGACY

COLUMN

ORACLE

TABLE

ORACLE

COLUMN

A person to whom health care is provided

PERSON

PERSON_NO

VK00023S

SUB_ID

MBR

MBR_ID

Table 1: Spreadsheets for Data Migration (Note that the data formatting columns have been omitted for display purposes.)

They also enabled us to begin designing our third normal form (3nf) database model. This is an extremely important step in designing a data warehouse. Most people would never think of using the terms 3nf and data warehouse in the same sentence, let alone the same project. But keep in mind the fact that we had little or no idea of what the users’ query criteria would consist of. By implementing a 3nf, we assured ourselves that we would supply all of the data at its most atomic level.

We maintained data diagrams and structures in Designer/2000 because it supported the iterative approach we were caught up in. Our next step was to build an interface between our spreadsheets and Designer/2000. The analysts would update their spreadsheets, and we would update Designer/2000 by executing a home-grown utility which would read from the spreadsheet and update the case repository.

As the data model began to solidify, we began searching for the best way to begin loading the database.

First, we transferred all legacy data to the reporting server, using ftp. We then created a separate tablespace and used SQL/Loader to load staging tables. The staging tables were given the same names as the source data files, as documented in the spreadsheets. Next, we built a second utility. This utility would generate scripts of insert statements with the following format:

‘INSERT INTO’ [destination table name, destination column list]

‘SELECT FROM’ [source table name, source column list] ; (See Example 1).

Example 1

INSERT INTO mbr ( mbr_id)

SELECT person_no from persons;

Note: The above statement would select all PERSON_NO values from the person staging table and insert them into the mbr table in the data warehouse.

These scripts were our first cut population scripts. They were an excellent source of coding specifications for programmers who had little or no knowledge of either the old or new systems.

Our next issue of great concern dealt with the user’s requirement to query by both legacy and new lookup values.

We can see by the descriptions in Table 2 that there are a total of four different claim types from both the legacy and the new system. But by looking at the codes, we can see that there is a collision where the value = ‘P’. We therefore cannot possibly use the claim type codes as the unique identifier.

Legacy Values		New System Values
PD	PAID	P	PAID
RJ	REJECTED	R	REJECTED
P	PREFERRED	F	PREFERRED
NP	NOPAY	N	NOPAY

Table 2: Displaying claim type codes and descriptions

Since we were committed to maintaining characteristics of both systems, we implemented our lookup tables as in Table 3.

WHSE_ID	LEGACY VALUES		NEW SYSTEM VALUES
1	PD	PAID	P	PAID
2	RJ	REJECTED	R	REJECTED
3	P	PREFERRED	F	PREFERRED
4	NP	NOPAY	N	NOPAY

Table 3: Shows the creation of a UID to map the separate source system values together.

Using the above data structure design, we were able to capture the proper WHSE_ID value from the lookup table during the load process. A front-end was then built to allow users to select from either list of source values. Their selections were used to retrieve the proper WHSE_ID value(s) to satisfy the user’s query.

Coding Style

All of our data migration scripts were written in PL/SQL. Each table would require a minimum of one script per data source. Cursors were used heavily, since we needed to maintain referential integrity based upon the newly generated WHSE_ID’s. In some cases, functions were found to be useful replacements for cursor selects. We wrote triggers to populate the WHSE_ID column by selecting unique values from sequences.

The load process was embedded in a series of UNIX scripts. The first script compiled each PL/SQL script in the database. This step assisted us in catching bugs prior to launching the load process. Although it was not foolproof, it did provide us with about 98% assurance. The only error type it did not account for is one that violated data formats.

The second script executed each script sequentially, based upon a predetermined order which satisfied referential integrity. Each step maintained log files to allow us to monitor the process from beginning to end.

Performance Strategies (Load Process)

There are millions of ways to write code to accomplish any one task. The trick is to discover the right coding techniques for each task. IT ALL COMES DOWN TO PERFORMANCE!!! A job is not done until it works "in a timely manner." Believe it or not, we actually benefited from our constantly evolving data structures because they gave us the opportunity to practice tuning. Virtually any modification to a script could result in a different execution plan.

Ask a developer whose job it is to perform tuning and their answer will invariably be the DBA. Ask the DBA, however, and (you guessed it) they will say it is the developer’s job. The fact is, tuning is everybody’s job. It does, however, begin with the developer.

It is very important to be able to test your code on a data set comparable to that used for production. Applications that work well on five rows might just react differently to five million rows. This difference likely has a lot to do with the optimizer.

Current versions of the Oracle RDBMS come with two optimizers. The original optimizer is known as RULE, and, as it name states, it determines the proper execution plan for a given SQL script based upon the rules of the optimizer. The second COST based optimizer gathers statistics on the data within tables, and determines the ideal execution plan based upon these statistics. Our database was in CHOOSE mode for optimization, although we later learned that the best choice in this case was COST. CHOOSE forces the database to evaluate both optimizers, while COST forces you to use COST. However, hints can be used to override the cost based optimizer with RULE BASED options.

The three types of joins used are nested loops, sort merge join, and hash join:

1. The nested loops method selects the driving table, scans it using a full table scan or index, and then finds all rows in the driven table that satisfy the join condition. This method is useful for OLTP type applications, but not for moving large amounts of data.

2. The sort merge method accesses data by performing a full table scan of each table, sorting the data by the joining columns, and merging the results of the sort. This method is good for batch operations. This method will not return any rows until 100% of the rows are merged.

3. The hash join builds partitions in memory and creates a hash table. The hash table has a connection from the values of the join column to the row in the partition. Oracle reads a record from the partition and joins via hash table to partition records in memory. It is only used in equi-joins, and is terrific for retrieving large amounts of data. HASH_AREA_SIZE (an init.ora parameter) should be set as large as you can afford.

Hints are suggestions for the optimizer, specifically to instruct the optimizer to override the COST path. Hints are implemented by placing the hint between the SELECT and the FROM clauses of a select statement, enclosed in comment markers ‘/*….*/’. They are not for use with group by, order by, set operators, or distinct clauses.

Explain Plan and TKPROF are two extremely helpful tuning utilities that come with the RDBMS. Explain Plan allows you to check the execution plan of a SQL statement while TKPROF allows you to check the execution plan of a PL/SQL script, embedding Explain Plan results for each select statement in the script.

Application tuning is a matter of trial and error. Even if you are certain that you are using the optimal access path, check it - you can never be too sure!

Let’s go over a few examples where TKPROF helped us.

Example 2

If you have a query such as:

SELECT colA

FROM tableT

WHERE colB = :variable;

The general tuning strategy would be to create an index on the column(s) in the where clause, which are your filter criteria. However, we have found a better alternative.

Rather than creating a single column index, we created a composite index:

CREATE INDEX my_index on tableT(colB,colA) ;

When you only index the filtering column(s), the execution of your query will perform two steps. The first will be a scan using that index to find the key value. The second step would then be to retrieve that same record, using table access by ROWID.

However, when you create a concatenated index, you can eliminate the second step of retrieval by ROWID, because your query can be satisfied by the index. It is hard to estimate the savings you will see, because it largely depends on whether or not your database parameters have been optimized. Our tests have shown about a 30 – 40% time savings.

Example 3

For example, in the following PL/SQL block:

DECLARE

CURSOR c1 IS

SELECT grp_id

FROM legacy_grp;

v_grp_whse_id legacy_grp.grp_id%TYPE := null;

BEGIN

FOR c1_REC in c1 LOOP

DECLARE

CURSOR c2 IS

SELECT grp_whse_id

FROM grp

WHERE grp_extrnl_id = c1_rec.grp_id;

BEGIN

OPEN c2;

FETCH c2 INTO v_grp_whse_id;

CLOSE c2;

/* here you are doing something else */

END;

END LOOP;

END;

In this example, you are going to read all records from vk00052s table and, based on a join, find all records from grp table.

As in Example 1, the logical first step is to create an index on the grp_extrnl_id column. However, there is another, much faster way to write that code. We are going to show how to use PL/SQL tables to obtain better performance. This how we did it:

DECLARE

CURSOR c1 IS

SELECT grp_id

FROM vk00052s;

CURSOR c2 IS

SELECT grp_whse_id,

grp_extrnl_id FROM grp;

TYPE my_type is TABLE OF

grp.grp_whse_id%TYPE INDEX BY BINARY_INTEGER;

tbl_grp_whse_id my_type;

v_grp_whse_id grp.grp_whse_id;

BEGIN

FOR c2_rec IN c2 LOOP

tbl_grp_whse_id ( c2_rec.grp_extrnl_id ) := c2_rec.grp_whse_id ;

END LOOP;

FOR c1_rec IN LOOP

BEGIN

v_grp_whse_id :=

tbl_grp_whse_id ( c1_rec.grp_id );

EXCEPTIONS

WHEN OTHERS THEN

null;

END;

END LOOP;

END;

In this example, we are populating a PL/SQL table for all records in a grp table. When we are ready. we read back from that PL/SQL table, and since that table was built in the memory, access will take much less time than going against the disk.

Based upon our sample data set, we saved about 60% of the time our original code took.

Example 4

We had two denormalized tables from the legacy system for which we had to create complicated logic in order to integrate them into one, normalized structure. Rather than having one column listing doctor’s identification numbers, and another listing the doctor’s specialty, the table would have multiple columns, one for each instance. For example, if three occurrences of radiologists were allowed, the table would have three columns, as follows:

RADIOLOGIST1

RADIOLOGIST2

RADIOLOGIST3

We wrote dynamic SQL using the DBMS_SQL package, to generate normalized rows from these legacy data structures. This code certainly did the job, although it took a significant amount of time to process.

Once again, we found a better method. Instead of using all of that complicated dynamic code, we combined the two legacy tables into one using UNION ALL, and filtering out unnecessary records with a WHERE clause.

Even though we spent some additional time creating that temporary table but after doing so, we realized that we no longer needed a dynamic SQL procedure. This example shows that sometimes a less eloquent method can give us the largest performance benefits, because the total time to execute our code dropped dramatically.

Example 5

If ever there was one tuning rule I believed in, it was that a FULL TABLE SCAN was the optimal method for accessing a small table.

We had a query:

SELECT colA

FROM table T

WHERE colB = :variable and colC is null;

In this example, Table T was small (550 records) and colB had only 270 distinct values. We performed a FULL TABLE SCAN on table T and the CPU time usage was 0.88 seconds. After we created an index on colB and used the INDEX SCAN method on table T, CPU time usage dropped down to 0.07 seconds. After finding this, we changed our opinion as to the optimal method for querying small tables.

Just a suggestion…???

Our database block size in this example was 4k and unchangeable, in spite of our pleas. The smaller the block size, the more times you must return to the database, thus impacting performance. We recommend using the largest database block size allowable by your operating system because of the performance benefits.

Performance Strategies (User Access)

After resolving the integration and migration issues,we could concern ourselves with performance from the user’s perspective. Remember, the 3nf was never intended as the primary resource for user reporting - rather, it is merely the foundation. We implemented a series of views, denormalizations, and summary tables to ensure timely response to user queries.

We also found that clustering a single table gave us better performance. We had a large table against which most queries were utilizing different combinations of (<, >, BETWEEN) on the EFFECTIVE_DATE column. We created a cluster and stored our table in that cluster. We also created a cluster index for the EFFECTIVE_DATE column, and saw retrieval time decrease.

How this project can be done better and faster today.

If we had it to do all over again, today, we would probably take a radically different approach. Because of its volatility, we had a difficult time turning the project over to the client’s employees, They did not have the in-house programming skills required to enhance the load scripts we developed.

Today, there are several viable data mapping tools on the market which radically smooth out this transition. They place the coding element behind the scenes by generating scripts from the mapping definitions which the user defines in the tool. Some tools, such as the one offered by Platinum, come with an interface to Designer/2000, while others such as Smart’s SMARTDB comes with its own diagramming tools.

All of these tools process flat files by default, although they may support a proprietary format that will assist you. One feature to look for is parallel execution. SMARTDB can read two files simultaneously which share a referential key, and populate two separate data structures, while maintaining referential integrity.

While sequential processing can paralyze a load, a scheduling tool can parallelize the load process, dramatically reducing execution time. Most have the ability to execute scripts at specified times, or based upon an event such as the successful completion of a prior script.

Hopefully, all of the data migration strategies discussed above and lessons learned from our project example will be useful in making future data migration projects go faster and more smoothly.