Implementing Materialized Views on a Generic Data Structure

Adrian Popescu, Dulcian, Inc.

Dr. Paul Dorsey, Dulcian. Inc.

 

 

Introduction

Most generic modeling involves the splitting up of core data structures where logical objects are stored in multiple physical tables. The object header stored is stored in one physical table and object values are stored in tall, skinny value tables. Such structures provide power and flexibility but frequently baffle junior developers accustomed to relational tables.

It is possible to have the best of both worlds by physically storing data in abstract structures and automatically generating materialized views with INSTEAD-OF triggers to enable novice developers to build traditional applications against the flexible data structures.

This paper will describe the architecture used to generate these materialized views, the technical challenges in making these views actually work and the pros and cons of using such structures.

 

What does Oracle 8i have to offer for materialized views?

The concept behind materialized views is not new to Oracle databases. It existed long before Oracle 8i, but was documented in the chapter about replication. In those days, it was called a snapshot and was used for data sharing between two usually remote databases.

 

The introduction of data warehouse features starting with Oracle8 has created a new usage for this mechanism. Queries that were collecting data from large volumes of data and involved complex statement executions can now be executed much faster by combining the theory behind regular views that hide the complexity of a data model, with the mechanisms of snapshots that instantiate data away from the OLTP environment. Such queries can now include aggregation functions and/or table joins. If the query rewrite option is used, the optimizer may even decide to use the physically instantiated data instead of re-building the statement execution plan and performing all of those costly joins.

 

Here is an example based on the BDEMOBLD.SQL script provided with the Oracle 8.1.7 installation kit:

 

CREATE VIEW SALES

AS SELECT                                 -- the underlying query

       SALESPERSON_ID, SO.CUSTOMER_ID, CU.NAME CUSTOMER,

       PR.PRODUCT_ID, DESCRIPTION PRODUCT, SUM(IT.TOTAL) AMOUNT

   FROM

 SALES_ORDER SO, ITEM IT, CUSTOMER CU, PRODUCT PR

   WHERE

 SO.ORDER_ID = IT.ORDER_ID AND SO.CUSTOMER_ID = CU.CUSTOMER_ID

   AND IT.PRODUCT_ID = PR.PRODUCT_ID

   GROUP BY

 SALESPERSON_ID, SO.CUSTOMER_ID, CU.NAME, PR.PRODUCT_ID, DESCRIPTION

 

A materialized version for this view can be:

 

SQL> CREATE MATERIALIZED VIEW SALES_MV

TABLESPACE USERS                          -- storage options

PCTFREE 10 INITRANS 1 MAXTRANS 255

STORAGE (INITIAL 131072 NEXT 131072)

BUILD IMMEDIATE                           -- build options

REFRESH FAST ON DEMAND                    -- refresh options

AS SELECT                                 -- the same underlying query

       SALESPERSON_ID, SO.CUSTOMER_ID, CU.NAME CUSTOMER,

       PR.PRODUCT_ID, DESCRIPTION PRODUCT, SUM(IT.TOTAL) AMOUNT

   FROM

 SALES_ORDER SO, ITEM IT, CUSTOMER CU, PRODUCT PR

   WHERE

 SO.ORDER_ID = IT.ORDER_ID AND SO.CUSTOMER_ID = CU.CUSTOMER_ID

   AND IT.PRODUCT_ID = PR.PRODUCT_ID

   GROUP BY

 SALESPERSON_ID, SO.CUSTOMER_ID, CU.NAME, PR.PRODUCT_ID, DESCRIPTION

 

Using the storage options can bring further performance gains by using the techniques available for regular tables, including parallel option, partitioning or nologging for the table that holds the materialized view data. 

 

The refresh options are the key point in maintaining data consistency between the source tables and the materialized view structure. The refresh options are also the limiting factor in the materialized view versatility. When they are used in a data-warehouse type environment, these limitations are not as oppressive as in an OLTP environment because the synchronization can be done on demand or on a regular schedule. The available options are:

·         refresh mode

·         FAST: specifies the refresh mode

·         COMPLETE: indicates a requery of the data source(s)

·         FORCE: a compromise, i.e. attempting a FAST whenever possible, and a COMPLETE otherwise

·         refresh triggering event:

·         ON COMMIT: ensures full-time consistency between data source(s) and the materialized view replica; it is the most expensive in terms of machine resources required

·         ON DEMAND:

·         START WITH <date>; NEXT <date>: it’s a revolving clause that is useful in non-time critical applications e.g. data warehouses

 

The differences in the execution plan for a SELECT * FROM the standard view and the materialized view are not surprising and are shown below and in Table 1.

 

SQL> select * from sales;

-------------------------------------------------------

0      SELECT STATEMENT Optimizer=CHOOSE

1    0   VIEW OF 'SALES'

2    1     SORT (GROUP BY)

3    2       HASH JOIN

4    3         TABLE ACCESS (FULL) OF 'CUSTOMER'

5    3         HASH JOIN

6    5           TABLE ACCESS (FULL) OF 'SALES_ORDER'

7    5           HASH JOIN

8    7             TABLE ACCESS (FULL) OF 'PRODUCT'

9    7             TABLE ACCESS (FULL) OF 'ITEM'

 

SQL> select * from sales_mv;

-------------------------------------------------------

0      SELECT STATEMENT Optimizer=CHOOSE

1    0   TABLE ACCESS (FULL) OF 'SALES_MV'

 

 

Joins and aggregate functions

Materialized view for the same query

recursive calls

db block gets

consistent gets

physical reads

redo size

sorts (memory)

sorts (disk)

rows processed

1438

17

317

0

0

28

0

163

229

15

53

0

0

4

0

163

Table 1: Execution plan data

If full-time synchronization is required between the data source table(s) and the materialized view, another structure is required in order to keep track of changes in the source(s), just like with the replication snapshots. This is a materialized view log, and one should be created for each data source involved in the query. Hence, for our materialized view, we would have to issue the following DDL commands:

 

SQL> CREATE MATERIALIZED VIEW LOG ON CUSTOMER WITH ROWID;

SQL> CREATE MATERIALIZED VIEW LOG ON SALES_ORDER WITH ROWID;

SQL> CREATE MATERIALIZED VIEW LOG ON PRODUCT WITH ROWID;

SQL> CREATE MATERIALIZED VIEW LOG ON ITEM WITH ROWID;

 

Moreover, the materialized view definition should include the row IDs from all the source table(s) to allow data synchronization:

 

 

SQL> CREATE MATERIALIZED VIEW SALES_MV

REFRESH FAST ON COMMIT

AS SELECT

       SO.ROWID SO_ROWID, IT.ROWID IT_ROWID, CU.ROWID CU_ROWID, PR.ROWID PR_ROWID,

       SALESPERSON_ID, SO.CUSTOMER_ID, CU.NAME CUSTOMER,

       PR.PRODUCT_ID, DESCRIPTION PRODUCT

   FROM

 SALES_ORDER SO, ITEM IT, CUSTOMER CU, PRODUCT PR

   WHERE

 SO.ORDER_ID = IT.ORDER_ID AND SO.CUSTOMER_ID = CU.CUSTOMER_ID

   AND IT.PRODUCT_ID = PR.PRODUCT_ID

 

By including the row IDs, we had to drop the aggregation function because the row IDs were not part of the GOUP BY clause. This actually raises the questions:

·         How versatile are materialized views?

·         Do they work with any type of query?

 

The restrictions come with the type of refresh modes related to the underlying query, as in our example above

If everything looks so great, does that mean that we can now jump and convert all our views into materialized views? The answer is the usual “it depends”, and the decision should be dealt with on a case-by-case basis.

Another obvious downside of it is that the data is stored at least twice. If the materialized views are used for replication purposes, this shouldn’t be an issue. However, if they are used in data-warehouse environments, then the performance gain resulting from instantiating the same information twice should outweigh the storage “loss” and the supplementary database management.

 

The generic structure

When the relational theory reached its maturity, the ERDs appeared too weak when it came to specification of the business rules. Extensions were created to complete the picture with what ERDs and relational representations could not accomplish, i.e. the dynamic side of a process.

 

At the same time, another way of thinking became apparent, namely the object-oriented approach., where UML emerged as a de facto standard in systems development. Why is UML better than ERDs? Because it is extensible when it comes to data models and/or processes, and it allows enforcement of more business rules than any ERD with its embedded grammar. This led us to the development of a business rule repository

 

The advantages of this implementation are beyond the scope of this paper. This solution created some interesting technical challenges.

 

Having a generic structure that is resilient to time and user requirements updates involves a higher degree of complexity in application development and a different database load. Instead of the usual OLTP environment with a fairly large number of tables, we ended up with a small number of tables that store the rule definitions and one “narrow” but very “tall” table that holds the data itself as shown in Figure 1. In such a configuration the Oracle optimizer is somehow perplexed, and provides very little support with regard to performance improvement. Something had to be done about the data holding table.

 

 

Figure 1: Generic structure ERD

 

Table partitioning was of little help because changes in the model would involve changes in the partitioning structure. Besides, regardless of the partitioning method, the performance gain was too small (below one order of magnitude).

 

A second option was the use of regular views based upon a set of packages that would hide the UML-engine mechanisms. This would be a relief to the junior developers who were less familiar with the UML thinking and felt more comfortable with a good old ERD. However, this would be a performance killer because not only we would have to traverse a single but large data structure, but we would also have to execute a high number of PL/SQL units.

 

Finally, we considered using the materialized views for the classes that are defined in the repository. This would give us back the options to tune the performance in a traditional manner by feeding the both the developers and the Oracle optimizer with a familiar structure.

 

Shortcomings and Solutions

We took a step back and looked at the requirements for our system, and compared those with the restrictions for the materialized views when aiming for full-time synchronization between the UML repository and the materialized view (i.e. FAST refresh ON COMMIT):

 

Requirements for a generic structure

Restrictions for materialized views

Allow dynamic WHERE clause building

Allow data filtering based on various criteria

Allow subqueries and views in the FROM clause

Allow set operators for the source data

All join predicates must be connected with AND operator

Cannot have predicates on individual tables

All objects in the FROM clause must be tables

Cannot have set operators

 

The future seemed less bright than we thought it would be.

 

However, on second look, we could borrow the idea of having the data instantiated in a relational manner, but create our own mechanism to generate these relational structures and maintain consistency between the two worlds. This is how we arrived at the structure shown in Figure 2.

 

 

Figure 2: Generic structure UML diagram

 

During the UML repository upload with class and business rule definitions, one module of the UML engine would build the table and relational view definitions for each class in the repository. It would also build the required triggers to populate the relational structures and to maintain synchronization between the two worlds. These DDL scripts can be run seamlessly against the database using EXECUTE IMMEDIATE, providing total transparency to the developers.

 

All user application data requests will be directed against the relational side of the repository. The SELECT statements will fetch the data from the relational views. On the other hand, all DML operations will be diverted by the INSTEAD OF triggers to the UML engine, which, in turn, will update the relational tables. The main purpose of the relational views is to provide the interface to the application developers and shield the relational tables from unwanted DML.

 

Conclusions

Functional limitations will keep the materialized views suitable for the distributed and data warehouse level only. Applications with tight constraints in term of consistency still have to create workarounds to overcome these shortcomings.

 

References

Yalim Gerger, Storing Business Rules Using A UML Repository, Virginia Oracle Users Group Conference (Oct. 2000)

 

About the Authors

Adrian Popescu has recently joined Dulcian, Inc. as a Senior Developer. He has worked on Oracle Designer and Developer projects in Turkey, Switzerland and Romania. Adrian has a degree in Medical Informatics from Charles University - Prague, The Czech Republic, as well as one in MSc - Software Engineering from the Polytechnic Institute - Bucharest, Romania.

 

Dr. Paul Dorsey is the founder and President of Dulcian, Inc. (www.dulcian.com), an Oracle consulting firm specializing in client/server and web custom application development. Paul is co-author with Peter Koletzke of The JDeveloper3 Handbook (2001), Oracle Developer: Advanced Forms & Reports (2000), and The Oracle Designer Handbook (1999), and with Joseph R. Hudicka of Oracle8 Design Using UML Object Modeling (1999), all from Oracle Press. Paul is the Executive Editor of SELECT Magazine. He is the President of the New York Oracle Users’ Group.