Data Modeling Using the UML

Dr. Paul Dorsey, Dulcian, Inc.

Many people question whether any part of the Unified Modeling Language (UML) can be used for data modeling. Some have suggested creating a new tool to explicitly support data modeling. However, with some extensions, the UML can be used very effectively to design databases.

With the advent of increasingly complex systems, a clear and concise way of representing them visually became increasingly important. The Unified Modeling Language (UML) was developed by Grady Booch, Jim Rumbaugh, and Ivar Jacobson as a response to that need. In order to create a single system for modeling and documenting information systems and business processes, UML was created with an underlying object-oriented analysis and design philosophy. To build successful systems, a sound model is essential. It communicates the overall system plan to the entire development team. As stated in the UML Summary Document (UML Summary, version 1.1, 1 September 1997, Rational Software et al.), the primary goals in designing UML were the following:

· Provide the users a ready-to-use visual modeling language so they can develop and exchange meaningful models.

· Provide extensibility and specialization mechanisms to extend the core concepts.

· Be independent of particular programming languages and development processes.

· Provide a formal basis for understanding the modeling language.

· Encourage the growth of the object-oriented tools market.

· Support higher-level development concepts such as collaboration, frameworks, patterns, and components.

· Integrate best practices

For the last several years, I have been investigating the use of UML class diagrams to design databases and I now use this mechanism almost exclusively to design data models. Much to my surprise, the UML has proven itself to be a superior tool to ERDs. UML is now the standard environment for object-oriented design and development. The most commonly recognized parts of the UML modeling environment are class diagrams, which somewhat resemble ERDs. There is some debate within both the object-oriented and relational communities concerning the applicability of UML class diagrams for representing structural data business rules such as those traditionally articulated in ERDs.

Six years ago, I attempted to resolve this issue in conjunction with the writing of Oracle8 Design Using UML Object Modeling (Dorsey& Hudicka, Oracle Press, 1999). In the course of writing that book, I concluded that UML class diagrams had not been originally intended for designing data models but were suited to the task and, in some cases, superior to ERDs. UML Class diagrams with the appropriate extensions represent a significant step forward in data modeling. Structural business rules can be represented more easily and completely using an extended UML syntax than was ever possible with ERDs. This paper will show the important extensions to UML and demonstrate the advantages of using UML for creating data models that represent the structural business rules of any system.

For all of the reasons just stated, UML should be the language of choice for building object-oriented systems.

I. Overview of UML class diagrams

UML is not just a replacement for entity relationship diagramming. UML encompasses several parts that together provide a complete object-oriented development environment. The part of UML that deals with data modeling is the class diagram. This paper will discuss the UML class diagram exclusively. We will only briefly mention any of the other parts of UML. It should be noted that UML covers the entire system design environment, not just data modeling. A complete discussion of UML can be found in any of the books written on UML available in any computer book outlet.

The first version of UML included the following diagram types

1. Class diagram This is the data modeling diagramming language. It is similar in scope to ER modeling.

2. Object diagram This is a class diagram for only one set of objects. Think of it as a data model where you show example data rather than the whole data model. This is very useful for explaining complex diagrams.

3. Use case diagram A use case is similar to the idea of a “function” in Oracle’s CASE method. A use case diagram shows the interaction among actors (for example, customers and employees) and use cases. There is no analogue to this diagram in Oracle’s methodology.

4. Sequence diagram A sequence diagram shows an interaction of objects arranged in a time sequence. This is similar to the process flow diagrammer in Oracle Designer.

5. Collaboration diagram (also called communication diagrams) A collaboration diagram shows the objects and messages that are passed between those objects in order to perform some function. There is no analogue to this diagram in Oracle’s methodology.

6. Statechart diagram Statechart diagrams are standard state transition diagrams. They show what states an object can be in and what causes the object to change states. There is no analogue to this diagram in Oracle’s methodology.

7. Activity diagram An activity diagram is a type of flowchart. It represents operation and decision points. This is similar to the data flowchart in Oracle Designer.

8. Component diagram A component diagram shows dependencies and organization among components.

9. Deployment diagram (also called implementation diagrams) A deployment diagram includes the run-time processing node configuration.

UML 2.0 added four new types:

10. Interaction overview diagrams Variation of activity diagrams that includes an overview of the system process flows

11. Package diagrams Subset of class diagrams used to organize elements into related groups

12. Composite Structure diagrams These diagrams show the internal structure of items such as classes or use cases including their interaction with other parts of the system.

13. Timing diagrams These diagrams are used to show changes in states over time.

The three most commonly used diagrams in data modeling are Use Cases, Class and Activity diagrams. In JDeveloper support is only provided for these three diagram types.

A. Classes

Those familiar with Entity Relationship Diagrams and relational modeling can think of a class as equivalent to an entity. Classes represent things of interest in a system or represent abstractions of things of interest. In an ERD, entities and instances of entities translate into tables and rows. In UML, classes and objects within the class are similar to the associated elements in ERDs, although they may translate into different types of components. Just like entities, classes also have attributes. Also, it is very common to use generalization (similar to ERD super/sub-typing) is class diagrams, so it is perhaps more precise to say that a class represents a thing of interest or an abstraction of a thing of interest.

B. Attributes

Attributes are given very little attention in the UML. Most UML books barely mention them. This is unfortunate. Attributes in an OO design are a much richer construct than in ERDs. Not only do you have the normal attributes familiar to data modelers, you also have things like derived attributes. However, the biggest complication arises from generalization. Attributes may be inherited from abstract or concrete classes and the class you are in may itself be abstract or concrete. Each type of attribute in each case is handled differently. It is beyond the scope of this paper to fully explore this topic.

B. Associations/Close Associations

An association indicates that one object has a link to another object. It does not tell anything about the nature of that link. Sometimes it is useful to model a tighter association between objects. You might want to say that one object is part of another, or that an object is partially defined by its associations to other objects. Such associations between classes are called close associations to distinguish them from regular associations.

From a modeling perspective, the association between a purchase order and a purchase order detail is quite different from the association between a project and an employee who is acting as project manager. In the purchase order/purchase order detail case, it does not make sense to have a purchase order detail without a purchase order. The detail is part of the parent object. A purchase is, to some extent, defined by its details. The details indicate what was purchased. In addition, details about purchase orders never move from one purchase order to another.

The association between a project and its project manager is quite different. Projects are independent objects. They are of interest to the organization regardless of who manages them. A project is not defined by who the manager is; and managers can easily be replaced on projects. Similarly, project managers are simply employees who, as one of their roles, can act as a manager of a project. An employee can also have other roles. An employee need not even be associated with any project and can manage several projects at once.

The relationship between a purchase order and its purchase order details is an example of a close association, because the child cannot exist without the parent. That is, they are closely related. From an implementation perspective, close associations are interesting because items that are close associations may have different requirements. Objects built from closely associated classes should be retrieved together, so they should be stored in such a way that makes retrieval of those objects efficient. Once a class is closely associated, you may want to prevent the changing of that link or to create, update, and delete closely associated objects together, so having them stored as some kind of grouped object makes sense.

Close Associations

Using the UML, there are two different types of close associations between object classes:

· Aggregation (also called “weak aggregation”)

· Composition (also called “strong aggregation”)

Composition and aggregation are new concepts for ER modelers and will require careful explanation.

Aggregation

In the UML aggregation association, objects from one class collectively define the objects in the aggregation class. Class A is said to be an aggregation of class B if an object in class A is defined as a collection of objects from class B. Objects from class B need not be attached to any object from class A. The classic example of this kind of association is the one between a committee and a person, in that a committee is made up of the people on the committee. A committee can be defined as a collection of people.

Aggregation does not correspond to any concept in entity relationship modeling. This is a new concept of a relationship that is much weaker than the dependency relationship. Aggregation means that the two classes are more strongly related than a simple association, but they can still exist independently.

In the relational dependent relationship, the child object cannot be thought of outside the context of its parent. In aggregation, the parent usually cannot be thought of outside the context of its children. The aggregation is represented by an unfilled diamond in UML. Some classic examples of aggregation relationships in UML are shown here:

The other aspect of an aggregation association is that the details (Person and Team in these examples) have relevance outside of the context of their masters (Committee and League).

Sometimes, aggregation is used because of a unique workflow. One system encountered by one of the authors included some government contract change requests. These requests came in individually over an extended period and were eventually bundled together into a contract modification, as shown in Figure 1.

Figure 1: Aggregation example in ERD and UML formats

If you show an aggregation in your class model, what does this mean for the generated structure? The child object is closely associated with the apparent object, but may also exist independently. The only impact is that if any part of the object is being modified, it should lock all related records in the aggregation.

If you are generating your user interface, aggregation could also be used to indicate that the child objects can optionally be viewed as parts of the parent object.

Composition

Composition (also called “strong aggregation”) is similar to aggregation. Class A is said to be a composition of class B if each object of class B is a part of an object of class A. Objects of class B may not exist unless they are part of a specific object from class A. Class B objects may not exist independently. An object from class B may not be a composition child of more than one object at a time, whether it is from class A or another class. In an aggregation association, the master is composed of its details, but the details can be independent of the master. In a composition association, the master is still composed of the details, but these details cannot be thought of outside the context of the master. The dependency examples of PO and PO Detail can be used to illustrate this, as shown in the ERD and UML diagrams in Figure 3.

The formal rule in the UML is that the detail can exist independently of the master until it is attached to a master. However, from that point forward, the detail must always be associated with some master. The distinction suggested here between aggregation and composition is a more restrictive condition than required by the formal UML syntax, but is more logically clean and consistent with the way in which databases interact with these constructs.

Figure 3: Simple composition association

The definition of composition is similar to the dependency relationship in ERDs, but a bit more restrictive than ERD dependency.

Composition is only used to indicate that objects in the detail object class always belong to one and only one master and have no independent meaning apart from that master. Therefore, PO/PO Detail association is a good example of composition.

Composition in UML is slightly more restrictive than dependency in an ERD. For example, in an ERD you might want to say that a Course at a university is dependent upon the Department where it is offered. Furthermore, specific Offerings of this course are dependent upon the Course, as shown in Figure 4. However, this would not be a composition in the UML.

Figure 4: Composition (UML) Dependency (ERD) comparison

Notice how the UML in Figure 4 uses simple association. Actually, using composition in this case would not violate the composition definition in UML, nor would it violate our more restrictive definition. However, in practice, object-oriented designers only use composition when the composition detail objects are created and destroyed at the same time as the parent object. Because Courses and Course Offerings are created completely independently from their parents, composition should not be used in this situation. Thus, from an implementation perspective, you might want to use composition even though UML tradition would argue against it.

The implementation of composition is similar to aggregation. Modification of any one of the related objects should lock the whole group.

If you are generating your user interface, composition could also be used to indicate that the child objects can only be viewed or modified as parts of the parent object.

C. XOR

In ERDs, you can specify that a particular instance of an entity can be associated with either an instance of one entity or another but not both. This is shown by a line that connects the two relationships. Of course, the same construct exists in UML. However, this structure is used far less frequently. The stronger generalization model in UML means that modelers will usually create an abstract generalization class attached to the association, thus eliminating the need for the XOR relationship.

In UML, you need not restrict yourself to XOR as the only relationship among associations. Other interactions among associations are possible, but are unusual and beyond the scope of this paper.

D. Generalization

The generalization association is a concept similar to that of a supertype/subtype relationship in an ERD. For example, an Employee can be either hourly or salaried as shown in this ERD:

In UML, the same concept can be represented as shown here:

This UML diagram indicates that you have an Employee class. The {abstract} constraint indicates that the class cannot have any independent objects. If the abstract constraint were omitted, this would indicate that it is possible to have an employee who is neither hourly nor salaried. Since the employee class cannot have any objects, what is its purpose? If there are attributes defined for the Employee class, they are inherited by the Hourly and Salaried classes. For example, a First Name and Last Name attribute defined for the Employee class would automatically be inherited by the Hourly and Salaried classes.

Associations to the Employee class are also inherited. An association between Employee and Department as shown in diagram A of Figure 5 also means that the salaried and hourly classes inherit the association to the department class just as if it had been drawn as shown in diagram B of Figure 5.

Figure 5: Association Inheritance

Methods are also inherited. For example, defined methods such as “Hire,” “Fire,” or “Give Raise” in the Employee class would automatically be inherited by the Hourly and Salaried classes.

II. Translation of Class Diagrams to a Relational Database

Translation from a class diagram to a relational database is not obvious. Of course, classes more or less map to tables and attributes map to columns. But the situation is more complex.

Support for generalization is particularly problematic. The traditional approach is to generate a table for each class. Either the inherited attributes are inherited resulting in denormalized tables or inherited attributes are not inherited, requiring a multi-table join. Neither of these situations is viable. This leaves the modeler with two alternatives:

1. Don’t use generalization.

2. Only use generalization for analysis and remove it for the implementation model.

Derived attributes provide a similar sort of problem. If designers use them, you end up with redundant columns in the database usually resulting in 3NF violations. If a generator is going to translate classes into tables and attributes into columns, then either the modeler must not use many class diagram elements or the resultant database will not be 3NF, usable, or both.

The alternative to direct translation of classes to tables is to generate both the table and an interface object (a view, ADF BC entity object, EJB or TopLink element). Then the database can be created as a 3NF structure with redundant elements residing only in the interface object. Of the three products reviewed here, only BRIM^Ò supports this alternative generation mechanism.

III. Products

There are several products that are relevant to discuss that use UML for data modeling.

A. Oracle’s JDeveloper 10g

Oracle’s JDeveloper 10g product has two mechanisms for generating tables from class diagrams. The first is from the Entity Object Modeler in the Application Development Framework (ADF) Business Components (BC) portion of the product. Oracle has architected its own middle tier component originally called Business Components for Java (BC4J) and now marketed as the business component portion of its Application Development Framework.

The normal usage of the business component framework is to start out with a fully-formed relational database and build middle tier components, generating most of the structure from the database. Developers can then modify this structure, adding significant business logic in the middle tier.

To accommodate requests from users wanting to model within the same tool, Oracle added the capability of first building the middle tier components and then using these components directly to generate the database. The problem with this approach is that the mapping from business component elements to the database is relatively simple-minded:

· Business component entity objects are directly translated to relational tables.

· Entities become tables.

· Attributes become columns.

Using this approach, foreign key attributes must be manually specified prior to generation. During generation, tables are dropped and recreated so that even the simple addition of an attribute cannot be done if there is already data in the tables.

The second JDeveloper mechanism for data modeling is an explicit class diagram where classes can be stereotyped as tables. This is a relatively straightforward database modeler where users define tables, columns, foreign keys, check constraints, etc. as in any other modeling tool. There is no notion of generalization or derived attributes included.

In both of the JDeveloper modeling mechanisms, the metadata repository is stored in XML files where the information is readable and, to some extent, editable. However, users should stick to the IDE or underlying APIs to interact with these XML files.

Oracle’s support for generating the database from JDeveloper 10g does not approach the vision of this paper where a designer might create a logical class diagram and have the product generate the appropriate database.

Pros: Tight integration with database, part of JDeveloper

Developers want to be able to create models within JDeveloper. JDeveloper 10g introduces a physical database modeler that allows users to specify tables, columns, and foreign key relationships using UML class diagrams.

JDeveloper is second to none as a Java IDE. As data modeling evolves, it will be a valuable part of the tool.

OO-centric designers who are satisfied with the generation algorithm will be happy with the product.

Cons: Somewhat limited in scope

Full UML-based modeling including inheritance, aggregation, and composition capabilities would be a welcome addition to JDeveloper at some point. The 10g release includes the beginnings of a solid data modeling tool using a limited subset of the UML.

The Database Modeler cannot be used yet for complete logical and physical database design. Oracle Designer should still be used for that purpose. JDeveloper users without access to a full-featured database design tool such as Oracle Designer may find JDeveloper's modeling capabilities adequate for simpler applications.

B. IBM’s Rational Rose Data Modeler

Paradoxically, IBM’s Rational Rose Data Modeler seems to be marginally farther along than JDeveloper in the generation of Oracle databases. Using this tool, it is possible to specify limited class diagrams which forward generate into data model class diagrams. Primary keys are automatically specified as foreign keys on relationships. However, fundamentally, this tool has many of the same limitations as JDeveloper.

The repository for Rational Rose is in its own proprietary document format, which is accessible and editable but making changes is not very easy. Where this tool excels is in the development of the software management process. Many resources have been devoted to the software development architecture, but resources devoted to database design are lacking. It would be useful to find out how the Rational team views database design from reading white papers on the IBM/Rational website. However, the overview articles do not even mention database design as being a part of the process.

In both Rational Rose and JDeveloper, database design virtually seems like an afterthought or a reflection of the relatively minor role played by database design in many OO-centric development teams.

Rational Rose Data Modeler is a visual modeling tool that makes it possible for database designers, analysts, architects, developers and anyone else on your development team to work together, capturing and sharing business requirements, and tracking them as they change throughout the process. It provides the realization of the ER methodology using UML notation to bring database designers together with the software development team. With UML, the database designer can capture information such as constraints, triggers and indexes directly on the diagram rather than representing them with hidden properties behind the scenes. Rational Rose Data Modeler gives you the freedom to transfer between object and data models and take advantage of basic transformation types such as many-to-many relationships. This tool provides an intuitive way to visualize the architecture of the database and how it ties into the application.

Pros: Industry-standard, Java-friendly

IBM’s Rational Rose is the industry standard for OO design and development. The product is mature and well written. It also well supports the industry standard way that OO people want to design databases, namely that “a database is nothing more than a place to store persistent copies of our classes.”

It is the best tool for generating tables that look like classes.

Cons: Odd generation algorithm

For all of its strengths, either you will end up with a poor database design or you will not use much of the richness of class diagrams. Classes and attributes will get directly translated into tables and columns.

C. Dulcian’s BRIM^Ò[1]

Dulcian, Inc.’s offering for data modeling using UML employs a business rules approach to fully generate systems using an “executable UML” approach. BRIM works exclusively in the Oracle environment and is not portable to other database structures.

Within BRIM, a class diagram is specified including inheritance, derived attributes, etc. Views and relational database tables to support the class diagrams are simultaneously generated. Using this approach enables BRIM to include a much richer specification within the class diagram than is available in other tools. Derived attributes and generalization are also explicitly supported. The BRIMrepository is stored in an Oracle database which can be queried or updated through APIs.

One disadvantage of the additional functionality that BRIM provides is some loss of developer control and flexibility regarding how things are generated. In comparison, JDeveloper and Rational Rose do not include as much functionality as BRIM or force a specific generation algorithm as is the case in BRIM.

BRIM chose to use object IDs (OIDs) as the physical primary key for all tables. It still stores and enforces the logical primary key, but uses OIDs to keep the implementation simpler.

Generalization sometimes causes redundant columns to be generated in underlying tables. The views that interface with the tables keep the data from getting out of synch.

Pros: Repository-based, rich functionality

BRIM generates both tables and views so that translating from a class diagram to a database results in both a good database design to make DBAs happy as well as a set of structures to make an OO development team happy.

Once the business rules have been placed in the BRIM repository, the system is generated. This is one of the real strengths of the BRIM environment. It is not necessary to wait for the system to be complete before generating a first version. BRIM developers should get into the habit of generating a system as soon as enough of the system has been entered to test. Additional system pieces can be quickly generated, supporting a RAD environment with virtually no cycle time.

The BRIM repository is a set of Oracle tables. Population of the repository need not be done exclusively through the Repository Manager. A complete set of APIs exists (some used by the Repository Manager), any or all of which can be used to manipulate the repository.

Cons: Oracle only; Highly proprietary solution

BRIM only works in the Oracle environment and takes a very strong stand on the “right” way to generate a database (and indeed the whole system). If you buy into the philosophy of the product, you will be very happy, but if you are looking for many options in the generation algorithms, BRIM will not support these options.

Conclusions

UML class diagrams are a great way to do data modeling. Unfortunately the tools to support this approach are still evolving. OO centric tools (like IBM’s Rational Rose) cater to OO designers who have little interest in good database design. Mainstream relational vendors such as Oracle have yet to figure out how to use a class diagram to generate a well-designed database. Fringe products like Dulcian’s BRIM may show promise but lack the credibility of products from larger companies.

ABout the Author

Dr. Paul Dorsey is the founder and president of Dulcian, Inc. an Oracle consulting firm specializing in business rules and web based application development. He is the chief architect of Dulcian's Business Rules Information Manager (BRIM^®) tool. Paul is the co-author of seven Oracle Press books on Designer, Database Design, Developer, and JDeveloper, which have been translated into nine languages. He is President of the New York Oracle Users Group and a Contributing Editor of IOUG's SELECT Journal. In 2003, Dr. Dorsey was honored by ODTUG as volunteer of the year, in 2001 by IOUG as volunteer of the year and by Oracle as one of the six initial honorary Oracle 9i Certified Masters. Paul is also the founder and Chairperson of the ODTUG Business Rules Symposium, (now called Best Practices Symposium), currently in its sixth year and the J2EE SIG (www.odtug.com/2005_J2EE.htm).

[1] In the interest of disclosure, BRIM^Ò is the tool architected by the author of this paper.