DATA
MIGRATION STRATEGIES FOR DATA WAREHOUSING
Joseph R. Hudicka, Sergey Guberman & Dr. Paul
Dorsey
Dulcian, Inc.
Overview
By
now, we are all aware of the fact that data warehousing is nothing more than a
new, flashy name for an age old wish - a reporting system which delivers all of
our data, all of the time, to all of our users, in a timely manner. It is a
process of transforming raw data into useful information.
However,
as reality sets in, we find that this is still no small task. There are usually
several data sources which must be integrated into one common repository. The
fact is, many data warehouse projects are really legacy system migrations which
can take advantage of newer, scaled hardware architecture, making migration an
affordable alternative.
This
paper will review our findings based upon two data migration projects. First,
the scope of the project will be defined. Second, we will review how we
accomplished our task. Lastly, we will discuss how we would probably do these
projects today.
Project Description
Our
task was twofold. We were to perform the migration of a legacy reporting
system, while integrating the data structures of this former system with those
of the new production system. Unfortunately, we had no involvement with the
production system, or OLTPS (On-Line Transactional Processing System)
migration.
Imagine
yourself as that little ship in the hi-tech 1970’s video game, "Space
Invaders" trying to hit the targets scrolling across the screen above. Designing
integrated data structures based on an evolving OLTP system is quite similar.
Unfortunately, however, this is one skill we cannot master from practice, since
every system brings with it its own oddities.
In
this particular case, of course, requirements documents did not exist, making
code specifications relatively impossible to assemble. It became quite clear
that since no significant advances had been made up to that point, we would
have to begin by developing project standards.
It
certainly came as no surprise that the legacy data structures did not map
terribly well to those of the new OLTPS. What was surprising, however, was that
the OLTPS migration team had already built an engine to transform legacy data
into the new system’s format (a bit of information which might have been
awfully useful early on in the project). Figure 1 below depicts the scope of
these two projects, simultaneously.
Fig.
1 - Project Scope
To
complete our task, we were required to write two separate transformation
engines - one to migrate data from the legacy system to the warehouse, and the
other to migrate data from the new OLTPS. What’s wrong with this picture?
Notice the dotted line pointing from the legacy system to the new OLTPS. This
is a transformation which had already been built by another group, which
converted legacy data into the format of the new OLTPS. Instead of forcing all
historic data through this one conversion process, we had to write and maintain
two separate processes, remaining ever aware that the warehouse might receive
the same transactions from both source systems.
One
final quirk was that the users required the ability to query by both the old
and new coding schemes. This would prove to be our performance nemesis.
How Data Migration Was
Accomplished
Our
team was divided equally: three members of our group and three business
analysts from our client. The first deliverable was a series of spreadsheets (See
Table 1) which were used to map the old and new data structures, while
capturing business definitions, data formatting definitions, and data sources.
These spreadsheets were a focal point of the project - they forced people to
take ownership of tasks by documenting their findings, leaving them open for
discussion.
Business Definition |
NEW TABLE |
NEW COLUMN |
LEGACY TABLE |
LEGACY COLUMN |
ORACLE TABLE |
ORACLE COLUMN |
A person to whom health care is provided |
PERSON |
PERSON_NO |
VK00023S |
SUB_ID |
MBR |
MBR_ID |
Table
1: Spreadsheets for Data Migration (Note that the data formatting columns have
been omitted for display purposes.)
They
also enabled us to begin designing our third normal form (3nf) database model.
This is an extremely important step in designing a data warehouse. Most people
would never think of using the terms 3nf and data warehouse in the same
sentence, let alone the same project. But keep in mind the fact that we had
little or no idea of what the users’ query criteria would consist of. By
implementing a 3nf, we assured ourselves that we would supply all of the data
at its most atomic level.
We
maintained data diagrams and structures in Designer/2000 because it supported
the iterative approach we were caught up in. Our next step was to build an
interface between our spreadsheets and Designer/2000. The analysts would update
their spreadsheets, and we would update Designer/2000 by executing a home-grown
utility which would read from the spreadsheet and update the case repository.
As
the data model began to solidify, we began searching for the best way to begin
loading the database.
First,
we transferred all legacy data to the reporting server, using ftp. We then
created a separate tablespace and used SQL/Loader to load staging tables. The
staging tables were given the same names as the source data files, as
documented in the spreadsheets. Next, we built a second utility. This utility
would generate scripts of insert statements with the following format:
‘INSERT
INTO’ [destination table name, destination column list]
‘SELECT
FROM’ [source table name, source column list] ; (See Example 1).
Example
1
INSERT
INTO mbr ( mbr_id)
SELECT
person_no from persons;
Note:
The above statement would select all PERSON_NO values from the person staging
table and insert them into the mbr table in the data warehouse.
These
scripts were our first cut population scripts. They were an excellent source of
coding specifications for programmers who had little or no knowledge of either
the old or new systems.
Our
next issue of great concern dealt with the user’s requirement to query by both
legacy and new lookup values.
We
can see by the descriptions in Table 2 that there are a total of four
different claim types from both the legacy and the new system. But by looking
at the codes, we can see that there is a collision where the value = ‘P’.
We therefore cannot possibly use the claim type codes as the unique identifier.
Legacy Values |
New System Values |
||
PD |
PAID |
P |
PAID |
RJ |
REJECTED |
R |
REJECTED |
P |
PREFERRED |
F |
PREFERRED |
NP |
NOPAY |
N |
NOPAY |
Table
2: Displaying claim type codes and descriptions
Since
we were committed to maintaining characteristics of both systems, we
implemented our lookup tables as in Table 3.
WHSE_ID |
LEGACY VALUES |
NEW SYSTEM VALUES |
||
1 |
PD |
PAID |
P |
PAID |
2 |
RJ |
REJECTED |
R |
REJECTED |
3 |
P |
PREFERRED |
F |
PREFERRED |
4 |
NP |
NOPAY |
N |
NOPAY |
Table
3: Shows the creation of a UID to map the separate source system values
together.
Using
the above data structure design, we were able to capture the proper WHSE_ID
value from the lookup table during the load process. A front-end was then built
to allow users to select from either list of source values. Their selections
were used to retrieve the proper WHSE_ID value(s) to satisfy the user’s query.
Coding Style
All
of our data migration scripts were written in PL/SQL. Each table would require
a minimum of one script per data source. Cursors were used heavily, since we
needed to maintain referential integrity based upon the newly generated
WHSE_ID’s. In some cases, functions were found to be useful replacements for
cursor selects. We wrote triggers to populate the WHSE_ID column by selecting
unique values from sequences.
The
load process was embedded in a series of UNIX scripts. The first script
compiled each PL/SQL script in the database. This step assisted us in catching
bugs prior to launching the load process. Although it was not foolproof, it did
provide us with about 98% assurance. The only error type it did not account for
is one that violated data formats.
The
second script executed each script sequentially, based upon a predetermined
order which satisfied referential integrity. Each step maintained log files to
allow us to monitor the process from beginning to end.
Performance Strategies (Load
Process)
There
are millions of ways to write code to accomplish any one task. The trick is to
discover the right coding techniques for each task. IT ALL COMES DOWN TO
PERFORMANCE!!! A job is not done until it works "in a timely manner."
Believe it or not, we actually benefited from our constantly evolving data
structures because they gave us the opportunity to practice tuning. Virtually
any modification to a script could result in a different execution plan.
Ask
a developer whose job it is to perform tuning and their answer will invariably
be the DBA. Ask the DBA, however, and (you guessed it) they will say it is the
developer’s job. The fact is, tuning is everybody’s job. It does, however,
begin with the developer.
It
is very important to be able to test your code on a data set comparable to that
used for production. Applications that work well on five rows might just react
differently to five million rows. This difference likely has a lot to do with the
optimizer.
Current
versions of the Oracle RDBMS come with two optimizers. The original optimizer
is known as RULE, and, as it name states, it determines the proper execution
plan for a given SQL script based upon the rules of the optimizer. The second
COST based optimizer gathers statistics on the data within tables, and
determines the ideal execution plan based upon these statistics. Our database
was in CHOOSE mode for optimization, although we later learned that the best
choice in this case was COST. CHOOSE forces the database to evaluate both
optimizers, while COST forces you to use COST. However, hints can be used to
override the cost based optimizer with RULE BASED options.
The
three types of joins used are nested loops, sort merge join, and hash join:
1.
The nested loops
method selects the driving table, scans it using a full table scan or index,
and then finds all rows in the driven table that satisfy the join condition.
This method is useful for OLTP type applications, but not for moving large amounts
of data.
2.
The sort merge
method accesses data by performing a full table scan of each table, sorting the
data by the joining columns, and merging the results of the sort. This method
is good for batch operations. This method will not return any rows until 100%
of the rows are merged.
3.
The hash join
builds partitions in memory and creates a hash table. The hash table has a
connection from the values of the join column to the row in the partition.
Oracle reads a record from the partition and joins via hash table to partition
records in memory. It is only used in equi-joins, and is terrific for
retrieving large amounts of data. HASH_AREA_SIZE (an init.ora parameter) should
be set as large as you can afford.
Hints
are suggestions for the optimizer, specifically to instruct the optimizer to
override the COST path. Hints are implemented by placing the hint between the
SELECT and the FROM clauses of a select statement, enclosed in comment markers
‘/*….*/’. They are not for use with group by, order by, set operators, or
distinct clauses.
Explain
Plan and TKPROF are two extremely helpful tuning utilities that come with the
RDBMS. Explain Plan allows you to check the execution plan of a SQL statement
while TKPROF allows you to check the execution plan of a PL/SQL script,
embedding Explain Plan results for each select statement in the script.
Application
tuning is a matter of trial and error. Even if you are certain that you are
using the optimal access path, check it - you can never be too sure!
Let’s
go over a few examples where TKPROF helped us.
Example
2
If
you have a query such as:
SELECT colA
FROM tableT
WHERE colB = :variable;
The
general tuning strategy would be to create an index on the column(s) in the
where clause, which are your filter criteria. However, we have found a better
alternative.
Rather
than creating a single column index, we created a composite index:
CREATE
INDEX my_index on tableT(colB,colA) ;
When
you only index the filtering column(s), the execution of your query will
perform two steps. The first will be a scan using that index to find the key
value. The second step would then be to retrieve that same record, using table
access by ROWID.
However,
when you create a concatenated index, you can eliminate the second step of
retrieval by ROWID, because your query can be satisfied by the index. It is
hard to estimate the savings you will see, because it largely depends on
whether or not your database parameters have been optimized. Our tests have
shown about a 30 – 40% time savings.
Example
3
For
example, in the following PL/SQL block:
DECLARE
CURSOR c1 IS
SELECT grp_id
FROM
legacy_grp;
v_grp_whse_id
legacy_grp.grp_id%TYPE := null;
BEGIN
FOR
c1_REC in c1 LOOP
DECLARE
CURSOR c2 IS
SELECT grp_whse_id
FROM grp
WHERE grp_extrnl_id =
c1_rec.grp_id;
BEGIN
OPEN c2;
FETCH c2 INTO
v_grp_whse_id;
CLOSE c2;
/* here you are doing
something else */
END;
END
LOOP;
END;
In
this example, you are going to read all records from vk00052s table and, based
on a join, find all records from grp table.
As
in Example 1, the logical first step is to create an index on the grp_extrnl_id
column. However, there is another, much faster way to write that code. We are
going to show how to use PL/SQL tables to obtain better performance. This how
we did it:
DECLARE
CURSOR c1 IS
SELECT grp_id
FROM vk00052s;
CURSOR c2 IS
SELECT grp_whse_id,
grp_extrnl_id
FROM grp;
TYPE
my_type is TABLE OF
grp.grp_whse_id%TYPE
INDEX BY BINARY_INTEGER;
tbl_grp_whse_id
my_type;
v_grp_whse_id
grp.grp_whse_id;
BEGIN
FOR
c2_rec IN c2 LOOP
tbl_grp_whse_id
( c2_rec.grp_extrnl_id ) := c2_rec.grp_whse_id ;
END
LOOP;
FOR c1_rec IN LOOP
BEGIN
v_grp_whse_id :=
tbl_grp_whse_id (
c1_rec.grp_id );
EXCEPTIONS
WHEN OTHERS THEN
null;
END;
END
LOOP;
END;
In
this example, we are populating a PL/SQL table for all records in a grp table.
When we are ready. we read back from that PL/SQL table, and since that table
was built in the memory, access will take much less time than going against the
disk.
Based
upon our sample data set, we saved about 60% of the time our original code
took.
Example
4
We
had two denormalized tables from the legacy system for which we had to create
complicated logic in order to integrate them into one, normalized structure.
Rather than having one column listing doctor’s identification numbers, and
another listing the doctor’s specialty, the table would have multiple columns,
one for each instance. For example, if three occurrences of radiologists were
allowed, the table would have three columns, as follows:
RADIOLOGIST1
RADIOLOGIST2
RADIOLOGIST3
We
wrote dynamic SQL using the DBMS_SQL package, to generate normalized rows from
these legacy data structures. This code certainly did the job, although it took
a significant amount of time to process.
Once
again, we found a better method. Instead of using all of that complicated
dynamic code, we combined the two legacy tables into one using UNION ALL, and
filtering out unnecessary records with a WHERE clause.
Even
though we spent some additional time creating that temporary table but after
doing so, we realized that we no longer needed a dynamic SQL procedure. This
example shows that sometimes a less eloquent method can give us the largest
performance benefits, because the total time to execute our code dropped
dramatically.
Example
5
If
ever there was one tuning rule I believed in, it was that a FULL TABLE SCAN was
the optimal method for accessing a small table.
We
had a query:
SELECT colA
FROM table T
WHERE colB = :variable
and colC is null;
In
this example, Table T was small (550 records) and colB had only 270 distinct
values. We performed a FULL TABLE SCAN on table T and the CPU time usage was
0.88 seconds. After we created an index on colB and used the INDEX SCAN method
on table T, CPU time usage dropped down to 0.07 seconds. After finding this, we
changed our opinion as to the optimal method for querying small tables.
Just
a suggestion…???
Our
database block size in this example was 4k and unchangeable, in spite of our
pleas. The smaller the block size, the more times you must return to the
database, thus impacting performance. We recommend using the largest database
block size allowable by your operating system because of the performance
benefits.
Performance Strategies (User
Access)
After
resolving the integration and migration issues,we could concern ourselves with
performance from the user’s perspective. Remember, the 3nf was never intended
as the primary resource for user reporting - rather, it is merely the
foundation. We implemented a series of views, denormalizations, and summary
tables to ensure timely response to user queries.
We
also found that clustering a single table gave us better performance. We had a
large table against which most queries were utilizing different combinations of
(<, >, BETWEEN) on the EFFECTIVE_DATE column. We created a cluster and
stored our table in that cluster. We also created a cluster index for the
EFFECTIVE_DATE column, and saw retrieval time decrease.
How this project can be done better
and faster today.
If
we had it to do all over again, today, we would probably take a radically
different approach. Because of its volatility, we had a difficult time turning
the project over to the client’s employees, They did not have the in-house programming
skills required to enhance the load scripts we developed.
Today,
there are several viable data mapping tools on the market which radically
smooth out this transition. They place the coding element behind the scenes by
generating scripts from the mapping definitions which the user defines in the
tool. Some tools, such as the one offered by Platinum, come with an interface
to Designer/2000, while others such as Smart’s SMARTDB comes with its own
diagramming tools.
All
of these tools process flat files by default, although they may support a
proprietary format that will assist you. One feature to look for is parallel
execution. SMARTDB can read two files simultaneously which share a referential
key, and populate two separate data structures, while maintaining referential
integrity.
While
sequential processing can paralyze a load, a scheduling tool can parallelize
the load process, dramatically reducing execution time. Most have the ability
to execute scripts at specified times, or based upon an event such as the
successful completion of a prior script.
Hopefully,
all of the data migration strategies discussed above and lessons learned from
our project example will be useful in making future data migration projects go
faster and more smoothly.