Logical Design of a Data Warehouse to Support Reporting, Ad Hoc Query, Executive Information Systems, and Decision Support Systems

 

Dr. Paul Dorsey

Dulcian, Inc.

 

Abstract

Building a data warehouse for use with an ad hoc query tool, Executive Information System or Decision Support System, is very different from the logical design of a production system. This paper discusses some data warehouse design issues and how different design decisions impact how easily various products can be integrated. An outline of a design methodology for the Data Warehouse will be presented along with an example data warehouse project proposal..

 

What is an ad hoc tool?

An ad hoc tool allows users to directly access data. When used correctly, ad hoc tools can fundamentally change the way that users interact with the data base. Through iterative querying, users can improve the quality of their decisions. However, even when correctly used, ad hoc tools will bring a system to its knees.

To minimize the system impact of ad hoc tools, most users resort to a data warehouse.

What is a data warehouse?

A data warehouse is a term that is used to mean many different things. I believe that the most natural definition for a data warehouse is a dedicated machine running its own DBMS with its own database, usually (but not exclusively) for use with an EIS. The key idea is that this a stand alone system, with minimal impact on the transaction processing system. This allows us to run EIS, DSS and ad hoc query systems without having to worry about their impact on the production database.

Why do we need a Warehouse?

The reason for the popularity of data warehouses with ad hoc tools is that ad hoc querying is unpredictable. Unrestricted ad hoc queries will bring any system crashing down unless the data base is smaller than most production systems. Even small data warehouses (2-5 gig) are large enough that most users find the need to off load them to their own servers.

The next reason is tuning. An ad hoc data base is query-only and non-volatile. The structure may be somewhat volatile but that is not a problem because of the flexibility of the best ad hoc tools. In general, an ad hoc database is tuned for query speed, exclusively. You can index as many fields as you want without fear of slowing down the transaction system. The only constraint is that there cannot be so many indexes that the time required for periodic updates is unmanageable.

 

The final reason is that the data structure of the ad hoc data base is usually very different from the production system. There is no need to worry about normalization, preventing redundant data, or any of the other things we worry about when designing production systems. In essence, you maintain a data dictionary completely independent of the production data dictionary.

 

There is really no way around the second data dictionary. We learned early on that tools such as Oracle’s Data Browser 1.0 (that required the user to query directly against the database) were too hard for the standard user. Production data dictionaries are too complex for anyone but the IS specialist. As a result, tools such as Business Objects (that assumed an extensive meta data layer) became the market leaders.

 

Perhaps the real question is that now that most users are running their ad hoc tools against a dedicated warehouse, do we still need a data dictionary within the ad hoc tool? We already have one (or more) data dictionary in the production system, and there is the second dictionary designed for the user in the data warehouse. Do we need yet another dictionary to build and maintain for our ad hoc tool? Certainly there is some value to being able to customize the dictionary for each user class, but is it worth the effort? This is an important question that I do not yet have an answer for.

 

Alternatives

One need not resort to a dedicated warehouse. It is possible to use a less drastic solution if there are adequate machine resources on the production system.

Production System

It is possible to query directly against the production system. All of the ad hoc tools are written to assume this configuration though it is already quite rare for users to run ad hoc tools against their production systems.

Dedicated Warehouse on Production Machine

Early in the ad hoc design process, it is possible to leave the ad hoc data base on the production machine in another Oracle instance. This way the ad hoc database can be tuned for querying but it is not necessary to buy another system. If the production system can handle the load, then a company could stay with this configuration.

Hybrid

An approach that is currently quite rare but will probably become the model for the future is to allow the ad hoc tool access to both the warehouse and the production system. Some queries require up to the second accuracy, while most queries work just fine with data up to a week (or even month) old.

 

Picking the right ad hoc tool

Matching the correct ad hoc tool to the users is a very difficult task. Fortunately, there are now a number of good products to choose from. At the time of this article, SQL*Assist from Software Interfaces, Business Objects, IQ, Cognos, and Andyne seem to be the products that most of their users seem to like. Oracle’s Data Query, is now only in beta, but it looks like it will also be contender.

 

There are two main considerations in choosing an ad hoc tool 1) the technical level of the users and 2) if you want to do more training for the users or if you want to make the tool easy enough to use that little training is necessary.

Know your users

Understanding the user population is essential to the success of an ad hoc project. The trick is to determine the kinds of queries to expect. If you survey the users, you will get very good data on what to expect. However, such a survey will also mislead you. Users will overestimate how many queries they will perform by a factor of 2, and the will underestimate the complexity of those queries. Almost half of the users that will clamor for ad hoc capability will never use the tool. The other half will push the tool to its limits. Count on being able to support a relatively small number of power users.

 

I have a user survey that I have made available to the ad hoc SIG. Anyone interested can contact the SIG for a copy.

 

Training or System Administration

There is no way that an ad hoc tool will decrease the amount of work that an IS department does. However, the IS department can choose whether they want to devote most of their ad hoc resources to training or to systems administration. You can either train your users to be experts in your database, or you can modify your database to be easy for users to use.

The experience of most users seems to indicate that the best solution is to put your effort into systems administration. Most of the really happy users have simplified their data warehouse to an almost absurd level. Some (very happy) organizations have gone so far as to only give their users access to single table views

 

How to build a warehouse

Strategy

The strategy phase of a warehouse design is pretty well the same as any other project. Since ad hoc query seems like a little project to most they tend to jump in without a strategy step. This can be a grave mistake. An ad hoc query project tends to be a large project with a good cost estimate being 20% of the original production system.

It is important to scope the project and to know the overall size of the effort before jumping in. Many ad hoc efforts died because it was assumed that they were relatively small projects.

 

Analysis

The analysis phase is different from most systems. Here we need to carefully assess the needs of the users. We need to know what queries each user will make. One way to do this is to have user keep a log of what queries they would have made if they had an ad hoc tool. These queries can then be analyzed. This analysis acts as the foundation for the ad hoc database design.

 

The logical design is just as difficult as the logical design of a production system. Unfortunately, it requires a very different skill set from production system design. The art behind ad hoc design is the decisions of what is the extent of the queries that the users will make and what is the best set of tables to support those queries. The way to do this is to break the queries into functional groups and to build relatively small ad hoc databases that each support a specific class of query. This approach greatly simplifies the design effort at the cost of some disk space. If the amount of duplicated information is prohibitive, the number of tables can be reevaluated in the design phase.

 

Design

The design phase has several tasks that are different from production systems. These differences are discussed in turn.

How many views of the data will be supported? The answer to this question is a balance between a simpler overall structure with views that are more complex and simpler views that can greatly increase the amount of system administration that needs to be done on the warehouse.

If the ad hoc tool supports meta data, how much will it be used? Now that most users are using dedicated data warehouses there may be little need for the tools data dictionary. In general, everything possible should be put into the warehouse dictionary and leave the tool dictionary as simple as possible. Frequently, a default 1-1 mapping of the warehouse dictionary is sufficient.

Try to use the same script for data conversion and periodic updates. There will need to be a script to populate the data warehouse initially. If that script is tuned and the warehouse is not too big then periodic updates of the warehouse can be done by destroying the warehouse and rebuilding it every time there is an update. This eliminates the worry about having an out of synch warehouse.

 

Implementation

Probably the best advice with respect to implementation is to roll out the warehouse slowly to your most advanced users. This will give you an unrealistic impression of how hard your users will be to deal with but it will give you a chance to warm the process up with users that will not be as demanding as your less technical users.

 

One strategy is to roll out the ad hoc product internally first. The disadvantage to this approach is that you may select a tool that less technical users will hate. Technical users tend to like tools that give them a lot of control over the SQL code and don’t really care about presentation quality. Non-technical users don’t care about SQL code and really want good graphing and charting capability.

 

Make sure a help desk is in place and training classes are prepared when you roll out.

 

Sample Data Warehouse Project Proposal

The following is an example of a proposal to build a Data Warehouse system for an organization which will support reporting, ad hoc query, Executive Information Systems (EIS) and Decision Support Systems (DSS). It is written in the format of a document suitable for presentation to management of an organization in the process of making decisions about modifications to an older production system.

 

I) Executive summary

The existing production systems were designed many years ago. Modern development tools and databases are roughly 10 times as efficient for application development as the current systems environment. Moreover, the current systems do not allow the organization to take advantage of emerging technologies (such as Internet applications), develop new applications (such as decision support systems), or allow end user direct information access (such as ad hoc query).

 

Existing systems will eventually have to be replaced with new systems built using modern relational databases. Unfortunately, such an effort will take several years and cost millions of dollars. It is preferable to find a way to get most of the benefits of modern systems without having to rewrite the existing production systems.

 

The best way to achieve this goal is through a Data Warehouse. A Data Warehouse is a copy of the production data stored in a separate database that is specially tuned to support reporting and other types of data access. This can be done for a fraction of the cost (in a fraction of the time) of a system redesign while providing all of the reporting benefits of a new system.

 

The proposed Data Warehouse Ad Hoc Reporting System will achieve the following benefits:

 

The new reporting system can be developed in stages so that advanced reporting capabilities can be delivered to different areas of the organization every few months. It will not be necessary to wait until the end of the project before any benefits are derived. Critical areas to support can be identified, and the new system delivered to these areas first.

 

A prototype of the system has already been done along with delivering limited functionality on the financial data subsystem. This validated the development process and guided the selection of the data access tools to be delivered to the users. As a result, an accurate estimate of what it will take to complete the project can be given.

 

Executive Information System (EIS), Decision Support System DSS, and Data Mining schedules will be determined as Phase 1 is built.

 

This will not be a temporary system. It will be used for all reporting even after current systems are redesigned and replaced. The new data warehouse system will be designed to provide long-term, flexible reporting capabilities to users.

 

 

II) Introduction

The current computer systems are designed to support the primary operations of the organization. They were built using older technology and are tuned to support both transactional processing and reporting. These systems are ill-equipped to handle applications that require advanced information retrieval capabilities.

 

Recent advances in computer technology present an opportunity to make the reporting process faster, more efficient and more responsive to the users’ needs. What is being proposed is to make a copy of the major portion of the organization’s information and store it on a separate computer. This is called a data warehouse.

 

The way that the new Data Warehouse Ad Hoc Reporting System will be designed and built will not just be by copying over the existing production database. Instead, a careful review will be done of the kinds of reports and decisions the organization makes. From this analysis, the best structure to support these specific reports and decisions will be created. This user-driven approach to designing the new data warehouse will give our system a life beyond the existing production system. Eventually the existing production system will be converted to modern technology as well. The new warehouse will not be dependent upon the production system but instead be based upon the reporting and decision making requirements of the organization.

 

Currently, with the organization’s existing systems, it requires an average of two man weeks to produce a report. This dictates the way decisions can be made. At this point in time, the decision making process involves initiating the decision issue, planning a report, gathering information, and, finally, making the decision. If more information is needed at any step in the process, this may require more analysis, testing and rework. Thus, a decision making process cycle can turn into a month long project.

 

The new system will not simply improve the efficiency of the current process. The proposed system will supply a technology that will fundamentally change the way that the organization makes decisions. It will enable the organization to develop and supply new reports overnight. In a limited way, it will allow users to create their own reports with no support from IS personnel.

 

Using the new system, decisions can be based on several successive reports pulled from the database rather than upon a single set of reports. This iterative approach to information retrieval can greatly improve managerial decision making.

 

III) Current System Inadequacies

In a rapidly changing environment, the organization’s current reporting system cannot continue to respond with adequate speed to requests for information or efficiently perform its own internal research.

 

Development Limitations -

In the current environment, it takes an average of 2 weeks to write a new report. If the shift to a Data Warehouse is made, that development time will drop to a few days per report.

 

New systems are very difficult to write on the existing systems, frequently requiring months or years of development time. Without a Data Warehouse, it will not be possible to supply an Executive Information System or a Decision Support System within a reasonable time frame.

Growing Amounts of Data -

The organization currently stores ## GB of data. Based on historical growth rates, the size of the database is expected to double in the next 2??? years. Either more hardware will have to be purchased or the organization must be satisfied with performance that is half as good as that of the current system.

 

As the amount of data increases, more time will have to be spent maintaining the software. Programs will have to be retuned to accommodate the larger data volumes. As the amount of data increases, the existing system reports will only be able to be supplied overnight since running reports during the day will slow down the system enough that day to day operations would be affected.

IV) Benefits to be Gained from the New System

The new Data Warehouse will serve to improve the functionality of the current system in the following ways:

 

V) Development Process and Costs

The key to the success of the new system is a thorough understanding of the user requirements. This will require serious involvement of the user community. When a production system is designed, it is only necessary to understand how work is done. To effectively design a data warehouse, it is necessary to understand the way that decisions are made. This is a much harder task. Access to a sampling of users from each area and level of the organization will required. Approximately 5-10 hours of time with each user is needed.

 

The greatest amount of time in designing the new system will be spent in the analysis of the user requirements. This involves extensive interviewing of personnel at all levels of the organization to find out what they need the new system to do. Making sure that all user requirements are gathered thoroughly and accurately will save a great deal of time and effort at later stages in the process.

 

However, this is not to say that nothing can be delivered quickly. If it looks like the requirements analysis phase will take too long, the new system can be developed using only the current reporting requirements. The system would then evolve as more is learned about the requirements of the various users.

 

Even if the system development is delayed until all user requirements have been collected, the system project will still be broken down into several smaller projects. One subject area at a time will be converted.

 

The remainder of the system design is similar to a traditional development process. The main exception is that closer attention must be paid to the problem of data migration since data will have to be constantly refreshed from the main production system.

 

The overall cost of the project is estimated to be between $### and $###.

 

VI) Political Environment

The success of this project requires involvement and participation at all levels. Operations people, decision makers and top management all have relevant information which needs to be brought into the warehouse design process. The organization’s personnel will need to be available to the system developers for interviews and at other key points in the system development process (as described above).

 

Without adequate input and cooperation, the system will not meet all of the necessary user requirements; and its usefulness and efficiency will be severely handicapped. All phases of the project must be endorsed at the highest levels of the organization.

 

VII) Work Plan

An important strategy in building a warehouse system is the quick delivery of some functionality, followed by time spent allowing the system to evolve as more information is gathered. Warehouse structures are inherently volatile.

 

It makes sense to have a reasonable quantity of requirements gathered before beginning the design of the new system. With a large number of user requirements, the overall structure of the warehouse may be different from one that is built from a smaller set of requirements. Therefore, enough requirements must be gathered in order to have a basic idea of the best structure before beginning to build.

 

Steps in the Work Plan:

  1. Analysis of user requirements - Careful analysis needs to be done prior to beginning warehouse development.

  2. Design of warehouse structures - Given the user requirements, a set of structures to support these requirements will be designed.

  3. Data Migration - The old data from the existing production system must be brought into the new warehouse system.

  4. Validation of the Migration Script - The procedure for moving the data from the old to the new system must be tested. The data being moved from the production system to the warehouse must be done correctly. Separate programs will be written to do this in addition to running exhaustive checks. This process will not only be able to validate the new data warehouse but will also catch errors in the existing production system.

  5. Summary and Aggregation Tables - In order to optimize performance, additional data structures will be built for the information that is retrieved most frequently.

  6. Migrating the Production Reporting System - The production reports will be run on the new system. Using modern reporting tools, it will be easy to deliver greater functionality than the existing production system. With the new system, users will be able to ask for complex variations of basic reports. For example, in a report that shows companies reporting to the organization, the user will be able to specify a report listing only those companies for a particular date range, from a particular region, within a specific industry or any combination of these categories. This can all be done by the user without any interaction with the Information Systems Department.

  7.  

  8. Ad Hoc Query Support - Users desiring the ability to generate their own queries will have direct access to an appropriate subset of the data warehouse. They can design their own reports and bring back the necessary information in real time.

VIII) Conclusion (of the sample proposal)

The proposed project will greatly enhance the organization’s reporting capabilities. The new system will provide state-of-the-art database technology in the form of database design and software. The new system will reflect a myriad of user reporting requirements at all levels. Errors and problems with the current production system will be identified and corrected. The new system will provide faster, more accurate and more flexible reports to meet the organization’s most demanding reporting requirements.

 

Conclusion

Data warehouse design is very different from standard OLTP system development. Unlike OLTP, the physical design of the warehouse is influenced by the front-end tools. The tools to be used to support ad hoc querying, EIS and DSS all influence the warehouse design.