Enterprise Data Architecture – Hunting the Woolly Mammoth

Dr. Paul Dorsey

Dulcian, Inc.

1. Introduction

One of the current hot topics in IT is managing the often large and complex enterprise-wide data of an organization.  This is perhaps the most important area of enterprise architecture.  It also touches on the issues of the semantic meaning of data as well as data quality.

 

Enterprise Data Architecture (EDA) is an organization’s logical and physical data environment.  It encompasses what information exists, how it is stored, and, most importantly, what it means. An organization’s EDA indicates how information is structured and how reliable that information is.

 

This paper will provide a framework for representing an enterprise data architecture.  The framework requires merging several modeling ideas and using them together to “bring down the woolly mammoth” of enterprise architecture.  

 

In most organizations, there are unconnected systems with overlapping information.  There is little, if any, documentation about the semantic meaning of the objects within those systems and their attributes. Even worse, over time the semantic meaning of the attributes (and sometimes even the objects) has changed making reports that span significant time periods meaningless and reports on older data misleading to users unfamiliar with the changing semantics.

 

There are also implicit rules about data quality that are rarely tracked, and even more rarely enforced.  Databases routinely enforce referential integrity and attribute domain values, but rules as simple as startDate < endDate are usually not enforced by the system.  Rules that are not enforced are inevitably broken, reducing the quality of the information in the database.

 

There is a compelling need to be able to access the information in the various structures.  To this end, many organizations resort to building data warehouses.  However, data warehouses are typically incapable of supporting all of the data architecture.  They are usually only focused on obtaining the information necessary to support a particular set of reporting functions. 

 

Tools by vendors like Informatica or Oracle’s Warehouse Builder are focused on the moving of data from one place to another, but do not provide information about the semantic meaning or quality of that information.

 

This paper will attempt to briefly describe the ideal architecture required for solving the enterprise data architecture problem.

 

2. The Architectural Issues

There are several questions to be answered:

  1. What are the existing systems and what is the semantic meaning of the information stored in them? 
  2. What is the overall enterprise logical data structure for the organization?
  3. How do the existing systems relate (map) to the logical data structure?
  4. What quality rules are associated with the data objects?

 

Once these questions have been answered, it will be possible to create effective reporting systems, build accurate data warehouses, identify and correct data quality issues, and better manage the data environment in general. Each of the questions listed above will be discussed in turn.

2.1 What are the existing systems and what information is stored in them? 

This is effectively a data modeling problem.  Data models can be created to describe the existing systems and the semantic meaning of the objects stored in the tables and the attributes.  Some potential problems with these data models include:

 

If the problems are particularly egregious, it may be necessary to provide multiple data models to describe the system: one or more for the physical database at a point in time and one for the logical objects that it purports to represent.  It would then be necessary to describe how the structures relate (this issue will be described in section 2.3).

 

Keep in mind, that even though this is mainly a data modeling issue, what is needed here is much more than a traditional data model.  It is necessary to describe the semantic meaning of the information in the database.  As a minimum, the tables and columns would need logical names and descriptions.    

 

There needs to be a separate model for each data structure in the organization.  There also need to be separate, smaller data models for each interface to another system.

 

One can argue about the correct way to describe a data structure.  Is the “right” syntax some form of ERD or are UML class diagrams better?  Though some may still disagree, UML class diagrams are the preferred method for this task.  They are now the industry standard for modeling and are widely taught in universities and industrial training.

 

More important than the modeling syntax is the way in which the system information is stored.  It should be placed in an open repository that can be easily queried and manipulated.  Products storing the diagram(s) in some proprietary format should be avoided.  The ideal repository is still a relational database.  Even though the industry trend is towards XML-based repositories, relational repositories are significantly easier and faster to manipulate.  XML repositories should be used only by organizations with little database expertise.

 

2.2 What is the overall enterprise logical data structure for the organization?

This question can also be viewed as a data modeling problem.  It is necessary to describe all of the logical objects of an organization and how they relate to each other.  This is the classic logical data model.  It can be partitioned into areas, but logically, it must be a single data model.  The model must be a logical union of all separate data stores in the organization.  This should be a complete representation of the organization’s data environment.

 

The model may represent any desired level of abstraction.  It is not necessary to create a class for each and every logically different type of object.  This model is intended to provide the correct level of logical abstraction to support the way in which users think, and not necessarily to maximize the number of classes in order to distinguish between all subtle differences between object types.  For example, there is nothing wrong with having a single Journal Entry class rather than one for each type of journal entry.  Similarly, it is possible to model a single journal entry detail class rather than a separate one for debits and credits.  However, it is not advisable to go to such a level of abstraction that users are unable to relate to the model. For example, having separate classes for things such as “Sale” and “Purchase” probably makes more sense than having a single class called “Merchandise Event.”   The correct level of abstraction is driven by what communicates most effectively to the user population.

2.3. How do the existing systems relate (map) to the logical data structure?

The task here is to articulate disparate database systems as well as the overall data architecture and describe how those systems relate.  The problem is that it is not possible to perfectly complete this task.  As you go from one system to another, there is usually some amount of data mismatch.  Anyone who has ever had to support a data migration is well aware of this problem. 

 

To illustrate this point, consider the example of people’s names, which are themselves complex objects.  Name elements include first, last, any number of middle names, maiden names, aliases, nicknames, former names, prefixes (Ms., Mr. ,Dr. and potentially the older Miss and Mrs.), suffixes (Jr, III, Ph.D.) some of which can be used together. In any system, designers make some choices about how to represent names.  Rarely in two different systems (even within the same organization) are these name objects represented in the same way.  Obviously, if one system captures prefix information and the other system does not, there is no way to cleanly map the data equivalence. 

 

This mapping is very difficult when the same information is gathered by two systems but the two systems represent names differently.  For example, one system might only have first and last name fields, placing prefixes and suffixes that have not been validated into the two fields.  Another system might use dedicated validated prefix and suffix fields.  Mapping those two sets of data to each other can be very challenging. In general, it is impossible to cleanly map objects from one system to another 100% of the time, even if they are storing the same information about the same objects.  Sometimes object types are stored at different levels of granularity.

 

The best solution to the mapping problem is to describe (as completely as possible) how objects from one system map to the other.  Usually, the most complete solution requires describing two maps:

By using both maps, you can explicitly represent the best way to move data between the systems. In general, maps are not completely “invertable,” meaning that knowing how to map data from system A into system B does not necessarily mean that you know how to map from system B to system A.

 

It is beyond the scope of this paper to discuss mapping algorithms.  Products supporting ETL mapping for data warehouse construction usually use repositories that adequately describe such maps.  If these repositories are open, they can be used to support EDA. There are numerous products that support mapping from one data source to another. Depending on the disparity between the two data structures, any number of those products may prove suitable for this phase of the EDA. 

 

Figure 1 graphically represents a partial EDA for an organization with four systems.  Each of those systems has been logically mapped to and from the enterprise data architecture.

 

 

Figure 1: Several systems mapped to/from logical EDA

 

 

2.4. What are the quality rules associated with the objects logically described by the data objects?

 

The final pieces necessary to complete the EDA are the data quality rules.  Quality can be categorized into several types:

1)      Simple rules supported by standard data modeling tools including:

a.       Required columns

b.      Referential integrity

c.       Simple data validation

2)      The attribute interaction rules within the same object that are easily supported using database triggers as in these examples:

a.       StartDate < EndDate

b.      For a box, Length + Width + Depth < 200 inches.

3)      Rules that span objects may be difficult to support with database triggers.  Some examples of this type are as follows:

a.       StartDate for a residence must be after the person’s birth date.

b.      A person must not have any gaps in their employment history.

4)      Contingent rules that are dependent on some condition. Any of the above rules that are only enforced when a particular condition are true are contingent. Examples include:

a.       A social security number is required prior to an employment start date being set.

b.      If the marital status of the person is “married”, then there must be a spouse record.

 

In a large system there can be thousands of these rules.  Even if simple rules are not included, there can easily be a thousand or more rules in a system. 

 

Usually these rules are not even written down anywhere. If they are known, they are enforced through application code.  Frequently, the only enforcement is through user training.

 

Anyone who has ever done a data migration from a system that did not explicitly enforce some quality rules to a system where those rules were enforced is well aware of how often rules that are not explicitly enforced are violated. A good rule of thumb in data migration is that any rule not enforced will be violated at least 1% of the time. This may mean millions of rule violations in a system.  Although Oracle provides the capability of “enable no validate” to try to solve this problem, this capability has probably caused more problems than it solved by allowing bad data to be migrated into a system.

 

What is needed is a validation rule grammar where rules can be easily articulated and represented in a repository.  Then those rules either be accessed at runtime or used to generate code. Few products exist that are capable of solving this problem. 

 

Validation rules should be described on the enterprise level, not the individual system level.  This way a rule need only be described once and can then be enforced in any system that gathers and manipulates that data. 

 

3. Now that we have the EDA, what do we do with it?

The EDA is a set of business rules that semantically describe an organization’s data and the validation rules associated with the data objects. The EDA can be used for a number of purposes:

  1. Data warehouse construction: The EDA can be used to design and construct a data warehouse.  The warehouse can either be an implementation of the logical enterprise data model, or it can be a separate system mapped from the logical enterprise data model. Since data usually only flows into a data warehouse, there is usually no need to describe the map from the warehouse back to the logical enterprise data model.
  2. Data warehouse ETL: The mappings from the various data sources to the logical enterprise data model along with the mapping from the logical enterprise data model to the data warehouse provide sufficient information to generate the ETL code needed to populate the data warehouse from the data sources. 
  3. Data quality enforcement: Using the validation rules, you can generate code that will validate existing data. The generated code can also be used to prevent incorrect data entry. 
  4. Managing system integration initiatives: You can specify a new data source and represent the map from the logical enterprise data model.  Then you can generate the ETL code that will populate the integrated data source.
  5. Building cross system views and reports: The logical mapping information can be used to generate views or otherwise support cross data source reporting.