DATA WAREHOUSES, AD HOC QUERY TOOLS AND OTHER WAYS TO DESTROY YOUR COMPANY

 

Dr. Paul Dorsey

Dulcian, Inc.

 

Introduction

Data warehouses are threatening to become the most overrated hype in the history of the computer industry. Not since the failed promises of the CASE tools of the 80’s have we seen so many products and so many experts leap into a field with so few skills, so little training and so little expertise.

 

Every consulting firm in the business claims to have been involved in dozens of successful data warehouse projects. They have all been creating data warehouses since before the term was invented. If this is the case and there is so much expertise, why are most organizations that undertake warehouse projects dismally disappointed with the results?

 

The fact is that data warehouse projects are much harder to complete successfully than traditional systems development projects. There are several reasons why this is the case:

  1. Designing a data warehouse is fundamentally different from designing an OLTP structure.
  2. The data warehouse tool environment is several orders of magnitude more complex than the traditional tool environment. Not only are there many tools available but many categories of tools to select from.
  3. Data warehouse projects are not core business systems. Therefore, they are much more sensitive to the political environment within an organization. Without complete user support, data warehouse projects are doomed to failure.
  4. The analysis process, including requirements analysis is fundamentally different from a traditional project.
  5. The problem of keeping the warehouse in synch with the production system is one that very few traditional developers have ever encountered.

 

For all of these reasons, data warehouse projects fail much more frequently than traditional systems development projects. Most presentations, papers and books on data warehousing are overly optimistic. They tend to paint a rosy picture of how, by following their method, you can build successful data warehouses. This paper will describe the less flattering reality of data warehousing. It is a world of unmet user expectations, wasted time and money, and ruined careers. Some of the more dramatic reasons for failure will be outlined. The paper will conclude with the critical success factors associated with data warehouse projects in the hope that your data warehouse project will not be among the failed projects discussed here.

 


Data Warehouse Development

The traditional impetus for a data warehouse project is often when a manager goes to a conference and sees an ad hoc query tool that looks exciting. Next, the organization sets some criteria (typically technical) and buys a tool. Notice that the tool has been purchased without gathering any user requirements. If the organization is reasonably savvy, they will undertake a small pilot project. Subsequently, they will design and build a major portion of the enterprise warehouse, usually allowing the DBA’s to drive the project. The resulting effort is then rolled out to the user community and management is left wondering why it cost so much and no one is using it.

 

Of course, one problem is that the project was completed without doing careful requirements analysis. All of the collective wisdom of the last thirty years about building traditional systems by doing careful requirements analysis is somehow deemed unnecessary when doing a data warehouse project.

 

What is a Data Warehouse?

Part of the root of the problem of failed data warehouse projects is that by calling it a data warehouse, we have already decided on the solution to a problem that hasn’t even been well defined. Therefore, I propose redefining what is meant by a data warehouse.

Traditionally, data warehousing has meant buying a data warehouse machine, installing a data warehouse database on it (usually in a star or snowflake architecture) and then accessing that information with an ad hoc query tool.

Instead, we should define a data warehouse as being "end user access to information." Conceptually, a data warehouse is simply the user view of the data. Physically, a data warehouse can be implemented in any number of ways:

  1. As views on an existing structure
  2. As a traditional data warehouse and ad hoc query tool
  3. As a flexible reporting system against a production database

Until you have carefully collected user requirements, you can’t intelligently decide which solution is best.

 

Why Do We Want a Data Warehouse?

What does the concept of the user view of data give us? What it buys us is built around the notion of iterative querying. Traditional querying involves a user faced with a decision who determines the information necessary to make the decision and requests or crafts a report to retrieve that information. With a data warehouse, we give users the ability to ask for information from the database. Then, based upon what is returned, more information is requested. This process can be repeated as many times as necessary until the user has gathered adequate information to act intelligently on behalf of the organization. This means that users really can make better decisions in a carefully thought-out data warehousing environment than in a traditional systems environment. However, for us as database designers and developers to deliver this incredible functionality, we need to be able to understand the types of decisions users make and how they make them.

 

Tied to this is the idea that, for the first time, users have the ability to feel as if they have ownership of the data. This can help users participate, not only in the way that applications are developed, but also in the way that information is structured and gathered within the organization. Although subtle, this concept can have a huge impact on the quality of our databases. When users began taking ownership of applications, user participation in application development as well as the quality of the applications created rose dramatically. Imagine the improvement in database design if users, along with developers and designers, can be thoughtfully engaged in the development process.

 

What Can a Data Warehouse Do?

There are four different aspects of a data warehouse project, any or all of which can be included:

  1. End user access to data: Standard ad hoc query and reporting tools are delivered to users.
  2. Depending upon the chosen physical implementation, you now have a dedicated database to make reporting much faster and easier to develop
  3. You can use a data warehouse project to support a very high level strategic Executive Information System (EIS) to provide access for higher level management
  4. Decision Support Systems (DSS) for business processes such as product pricing and manufacturing can also be supported by a data warehouse.

 

These four aspects are what should drive the definition of the data warehouse. A data warehouse project should not be defined by its implementation solution. It should be defined by what is trying to be accomplished.

 

Why Do Ad Hoc Query Tools Fail?

Ad hoc query tools are often regarded as a panacea for satisfying user reporting requirements. We create objects, give users access to those objects and still experience project failure. This is due to the fact that users need to be very well educated in the productive use of an ad hoc query tool. Users need to have some understanding of the following factors in order to successfully utilize an ad hoc query tool:

 

A) Users need to understand Boolean logic.

A computer’s perception of "and/or" is fundamentally different from human perception. One of my favorite examples is to try to ask a managerial user to craft a correct logical statement to support the information request: "Return a list of all employees who are neither from New York or Chicago." Users are asked to determine which of these statements correctly identifies that employee list:

 

  1. Location ¹ ‘NY’ & Location ¹Chicago
  2. Not (Location = ‘NY’ and Location = ‘Chicago’)
  3. Location ¹ ‘NY’ or Location ¹Chicago
  4. Not (Location = NY or Location = ‘Chicago’)

 

Having performed this experiment numerous times with executives at all levels, the best results I have had are indistinguishable from chance. The first time I ran the experiment with nine managers. They were asked to select one answer. The results were that only one of nine selected one of the correct answers. It took him ten minutes to figure it out and that person had been a systems professional for eight years. If an experienced systems professional needs ten minutes to figure out a WHERE clause for a simple query (and we have evidence that few managers will be able to do this), why do we expect that the reports that managers are creating using an ad hoc query tool are the reports they think they are creating?

 

B) Users need to know basic SQL and relational theory.

With very few exceptions, these tools are all obviously front-ends with SQL. If users do not understand the concept of a table and a foreign key, they cannot be expected to intelligently use tools with "group by" options on their main screen.

 

C) Users need to understand the ad hoc query tool itself.

We have done a fairly good job of training users to push the right buttons. I am, however, reminded of experiments with monkeys in space and piano-playing chickens.

 

D) Users need to understand the structure and nuances of the database being queried.

For example, in one data warehouse, the sales table had three years of information in it. The user wanted to find out which department had the best performance for the last year. The user was not aware that this table needed to be filtered to get the relevant information. The user assumed that the system would automatically choose only the current year’s information because that makes sense.

 

E) Users need to understand iterative querying.

The true impact of a data warehouse on a user is evident if iterative querying is used. Users don’t do this naturally. They must be trained to think in those terms.

 

Meta-Layer Set-Up

The meta-layer is what the user interacts with. In general, the user does not directly access the tables even in a data warehouse. Either views are created; or, if an ad hoc query tool is used, the complexities of the data structure are hidden from the user in the tool itself.

Ad Hoc Query Tool Types

The world of ad hoc query tools is an order of magnitude more complex than is generally perceived. Not only are there many tools, but many types of tools. Normally, a warehouse team will only be familiar with one or two ad hoc query tools, let alone one or two types. A full discussion of each of these types is beyond the scope of this paper. Each type of tool has its own appropriate use. Usually, in order to support all user requirements requires more than one tool from more than one category.

 

CATEGORY

TOOLS

Ad Hoc Query

Oracle Data Query

Cognos Impromptu

Business Objects

End User Reporting

Crystal Reports

R & R

 

Power User Reporting

IQ

ProReports

 

OLAP – Server side

Holos

Oracle Express

Micro

Strategies

OLAP – Client side

Cognos PowerPlay

Business Objects

 

Production Reporting

Developer/2000

SQR

 

EIS

Forest & Trees

 

 

DSS

SAS

 

 

 

Implementing a Data Warehouse Project

The back-end of a data warehouse project can be implemented various ways:

  • Views on the production system
  • Views on the production system and some redundant columns and tables
  • 3NF copy of the production system
  • Aggregated warehouse – OLAP
  • Data Marts

 

There is a spectrum of cost associated with these implementations from the lowest which involves simply placing views on the production system to the most expensive involving creating data marts. Paradoxically, the most frequently selected alternatives are the most expensive. This leads one to wonder whether what most projects are trying to maximize are development dollars.

 

The front-end can be implemented in any one of four ways:

  • Flexible reporting system
  • Canned queries
  • Single table ad-hoc reporting
  • Full ad hoc query tools

 

These front-ends follow a spectrum of both risk of failure and from the safest and cheapest flexible reporting system to the riskiest and most expensive using a full ad hoc query tool. It is now possible to build a flexible reporting system quickly and easily. For more detailed information on this topic, see Joseph P. Strano’s paper on Flexible Reporting Systems in these proceedings. Such a flexible reporting system can often replace the need for an expensive ad hoc query tool while providing users with a safe and friendly environment to generate their own reports where they are protected from runaway illogical and mis-specified reports.

 

Data Migration: The Hidden Nightmare

Getting information from a production system is always a complex task. Unfortunately, the way this is usually done is by using the production system software (usually COBOL) to build flat file images of the warehouse tables. Even most existing data migration utilities are built on this model. Performing data migration this way can be disastrous. In one case I encountered, the population utility for a 10-table pilot project required 2,000 pages of COBOL code. Often, a far better solution is to create all of the complex migration scripts within Oracle on the warehouse side rather than the legacy system side. Very simple legacy tables can be extracted into Oracle where more complex migration is easier to accomplish. See Figure 1 below.

 

 

 


Image12.gif (10497 bytes)

Figure 1: Data Migration Strategy

 

The 2,000 page COBOL script mentioned above done in a similar Oracle system required approximately 100 pages of code. In addition, we wrote a migration script generator using Oracle Forms. This generated over 95% of the PL/SQL migration script automatically.

 

Top Ten List of Reasons Why Data Warehouse Projects Fail

All of the reasons and situations mentioned are real-world examples.

  1. The most common reason for failure is when the data warehouse is built before user requirements are gathered. What we all recognize as illogical in the OLTP environment is commonplace in the data warehousing environment.
  2. Choosing a front-end tool based mainly upon cost considerations rather than selecting a tool that meets user needs can lead to project failure. Data warehouse projects are all so complex and expensive and the benefits so great that the cost of the front-end tool selected is irrelevant.
  3. Expecting users to understand Boolean logic as discussed above can cause data warehouse projects to fail.
  4. Giving users too many tables in an environment that is too complex often results in failure.
  5. COBOL migration, as mentioned above, can consume up to 80% of the total project budget.
  6. Building end-user reports using COBOL because in-house developers already know COBOL rather than learning the new reporting system can lead to project failure. Modern reporting tools are much more efficient for development.
  7. Using a data warehouse project to get started with Oracle systems can lead to failure. Data warehouses are quite complex. If an organization is contemplating switching their systems platform to Oracle, building a data warehouse is not a recommended first step.
  8. Not bringing over enough data into the warehouse can lead to failure. When a data warehouse is constructed, designers lean toward simplified star-schema or snowflake structures. Data warehouse design is different from OLTP. A full discussion of these architectures is beyond the scope of this paper. Briefly, unless you want the warehouse to also perform this function, data validating business rules need not be included. This lack of a need for rules to enforce database integrity allows database structures to be greatly simplified. The star and snowflake schemas are archetypal examples of this simplified structure. A common mistake with such structures is to not bring over enough data to meet user needs. Usually such structures aggregate at too high a level, are too incomplete and are inadequate to support user needs almost immediately after the system goes into production. This is yet another ramification of incomplete requirements gathering.
  9. Making an organization’s first data warehouse project enterprise-wide can often result in failure. This first project should be done for a subset of the organization before attempting a larger project.
  10. Even though the pilot project failed, it is assumed that enough is understood about the problems causing the initial failure to build a production system rather than doing a second pilot project. If the pilot project fails, another one should be undertaken.

 

Critical Success Factors for Building a Data Warehouse

The following are of primary importance in assuring that a data warehouse project is successful:

  1. The project leader must be experienced. The project leader should have completed other successful data warehouse projects and be aware of the different types of end-user tools including flexible reporting, ad hoc query and OLAP alternatives.
  2. Careful collection and analysis of user requirements including legacy reports is crucial. It is important to verify which reports are actually being used. Simply gathering the names of reports is not enough. All ad hoc reports gathered over time should be examined. Of particular interest are reports given to users as ASCII files which users are inserting into Excel or SAS to generate their own reports. It is essential to know specifically what users are doing with their reports.
  3. Get a sense of how users will use a new ad hoc tool. Have users keep a query log of the things they would want from a new system. Be sure to track not only reports that users would like to generate but also the decisions that they are trying to support. This will allow for assessment of the relative importance of different types of information to guide decisions about what to include in the warehouse.
  4. Do a pilot project. Choose the most enthusiastic users to do serious requirements analysis. Try several different tools for migration, back-end and front-end implementations. It is worth spending lots of money on the pilot project rather than plunging into spending even more money on a large project that fails. Expect that the first pilot project will fail. If this is the case, do another one.
  5. Make the users happy. Warehouse projects are far more politically sensitive than any other type of project. The resulting system must be easy to use, producing accurate results. All it takes is one user making one bad decision based upon the system to have all users lose confidence.

 


Conclusions

  • Warehouse projects are different from traditional relational design. Don’t assume that a skilled traditional developer will be able to lead a warehouse project.
  • With several classes of ad hoc query tools, the optimal tool selection will probably involve more than one tool from more than one class.
  • A flexible reporting system should be considered as an alternative to ad hoc query access. It is a much cheaper and safer alternative which can probably supply 90% as much as any ad hoc query tool.
  • Data migration can sink the project. Not only are migration scripts large and complex, they must be maintainable because they have to keep the warehouse in synch with the production system when the structure of either changes. This is not like a legacy migration script that is used once and discarded. Because it must be run periodically, the script must be tuned to run efficiently and maintained easily.
  • Gather user requirements carefully and completely before building the data warehouse.

 

Keeping all of these factors in mind, the likelihood of creating a successful data warehouse will be greatly increased.