Modeling Lookup Tables
One topic that caught my eye on the Foxite forum recently was a discussion of how to implement Lookup Tables. Like many such questions it provoked strong opinions and definitive statements of the "Thou shalt ...." type (i.e. the type of "dogmatic logic" based on opinion, instead of reasoned argument). The question centers on whether it is better to use a single table for all lookup data (the so-called "One True Lookup Table [OTLT] approach) or to have separate tables for each different set of lookup data that is required. As with many such data The usually quoted conventional wisdom is that:
Developers like the idea of a single lookup table, but Database Administrators do not
and the reason why developers like them is that they simplify the creation of maintenance screens and allow generic code to be written to handle the population of controls in the user interface. Conversely the reasons why DBAs do not like them are that (according to an entry in "Systems Engineering and RDBMS" at http://decipherinfosys.wordpress.com/2007/02/01/otlt-one-true-lookup-table/ )
This is an OO design pushed into the database tier and poses many problems including :
-
Not being able to force foreign keys (so your data integrity is compromised right there),
-
Not being able to use the right data-types since everything has to be a string in order to accomodate all data-type values (this itself can give rise to bad data issues since a numeric(4,2) constraint cannot be enforced –> thus domain integrity is out of the door,
-
Because of (b), the length for the string column will be huge,
-
One has to then write complex SQL queries in order to retrieve the data and also take into account the type conversions that occur when you join this table with the other tables…implicit conversion is not allowed in all RDBMS and is not a good thing anyways.
However, as all too often with such didactic statements, this is based on a couple of assumptions; first, that all lookup data must be contained within a single table and, second, that all "lookup data" is the same and must all be handled the same way. Now, if those assumptions are really correct, then there is no question that using a single table for all lookup data is a bad idea (for all the reasons given). However, are they really valid?
The key to understanding this issue lies in one of the most basic principles of good database design - which is that "a database must model the business process that it supports". When viewed from this perspective it turns out that the assumptions on which the argument is based are incorrect. What we are taking about here is what I defined in my article "Designing A Database – Part 2" (see November 2008 on this web log) as Support Data. In that article I said the following:
Although support data cannot be derived from other data and therefore requires permanent storage, it differs from core data in that it is does not form part of the body of data upon which business applications operate directly. This includes all data that is used define how the system behaves (i.e. metadata and preference data), or how the data actually stored in the system is interpreted (i.e. look-up and reference data). The support data is a key element in the database in that it is primarily concerned with the translation of data into information – and this applies equally to metadata as it does to look-up or reference data.
Notice the highlighted section. The rules of data normalization require that we use Foreign Keys to reference values that would otherwise be repeated in a table and such values are usually referred to as "lookups" leading to the idea that they are all equivalent in some fashion. In reality that is not the case and there, as indicated, TWO types of data that have to be dealt with - "lookup" and "reference" data. In the previous article I did not expand on the distinction because it was not really relevant in that context. Let me do so now, because this distinction is, in my opinion, the key to the whole OTLT issue.
Lookup Data should, as the name implies, consist only of things that can be "looked up" or, to put it another way, consist of only one piece of significant information. So for a lookup, the foreign key directly substitutes for a denormalized value. Conversely, "Reference Data" may comprise many pieces of related information and a the foreign key is used to retrieve the set of information that is applicable. You may be thinking, about now, that I am making an artificial distinction, but believe me, I am not. Table 1 shows a few examples of "lookup keys" and how they map to the different categories.
Notice that for all keys that fall into the lookup category there is always just one piece of information – a description, and nothing else. While in the reference data category there is always more than one piece of information associated with the key. I can't speak for anyone else but myself, but to me it is obvious that there is no way that storing "Customer Credit Terms" in the same data structure as "Units of Measure" would make any sense.
So the assumption that all lookup data is somehow the same is false to start with. For anything that falls into the Reference Data category, an appropriately designed table should always be used and I absolutely concur with the premise that doing anything else is poor design.
But what about the Lookup Data category? The fact is that all of this data is actually 'enumeration' data and that is probably a better term to use than "lookup" for this type of data. Reduced to basics it comes down to a Key/Value pair. Even so, if we were to use a single table for all such data we do run into the issues raised by opponents of the OTLT approach. Specifically, keys become difficult to define and either we have to enforce uniqueness on the "lookup key" or we have to introduce a 'Type' column to create compound key (i.e. Type + Value). Neither option is attractive and it would indeed complicate the SQL required if we had to adopt such a solution.
But hang on! We are talking about relational databases here (we wouldn't have foreign keys if we weren't)! So why are we constrained to use a SINGLE table? The short answer is that we are not and, for enumerations (i.e. that consisting only of code and description) we can use a PAIR of tables. The structure that I like to use is shown in Table 2.
Note that all of the issues disappear if we adopt this approach. Each table has its own (surrogate) primary key, and the relation between Type (header) and Enumeration (detail) is maintained by the foreign key in the detail table. There is no need to artificially restrict the key values in the detail table and a simple SQL inner join returns the appropriate set of enumerations. If there are specific sets of data that are used often by applications, we can define views for them and create either stored procedures, or functions, to handle specific requirements. The added columns on the detail table are for the convenience of the UI developer. The "Active" flag is used to indicate whether the value should be made available in the UI, and the "Default" flag indicates which value (if any) in the set should be used as the default in the User Interface. They do not have any significance, or impact, at the database level.
So my position is that reference data requires its own tables, but enumerations can be handled using a header-detail pair of tables. This does, however, raise another question. Should enumerations even be stored in the database? After all this data can be handled directly in code if all it is doing is providing expansion of key values to their associated description. The short answer is that they most definitely should be, and there are several reasons.
-
If enumerations are defined only in application code, then adding a new item to the selection always means recompiling, and re-distributing, the application. Admittedly if the new item requires special handling in some way, then its addition would still require code changes, but the addition of, say, a new contact type would probably not require any code change at all and can be handled by simply adding a row to a table and letting the code that accesses the enumeration tables return the new type.
-
Unless the enumeration is stored in the database, there is no way to decode the enumeration key values for reports or when exporting data to another application (e.g. an Excel spreadsheet for reporting or analysis) or to a data warehouse.
-
When multiple applications access the same database, there is not only the issue of ensuring that enumerations are available to each but also that of ensuring that they are synchronized and consistent. The scenario where one application interprets an enumeration key of 1234 as "Home Phone Number" while another defines it as "President" does not bear thinking about!
So whether we are talking about an On Line Transaction Processing, a Management Information System, or a Data Warehouse, we always have to cater for two distinct types of "lookup data". The enumerations that are used solely to normalize values that would otherwise have to be repeated and the reference data that is used to manage various aspects of the application. They are totally different entities and should be handled in different ways.
So is there a case for the "One True Lookup Table"? Not in the way that it is usually defined. The trick is not to confuse reference data with enumerations and not try and shoe-horn reference data that has different attributes into a single common structure. However, a pair of tables can certainly be a much more efficient way to handle enumerations than having lots of separate, identical tables, each of which contains only a few static rows.
For example a CRM application I worked on recently has 40 rows in its lookup header table (i.e. we would have required 40 tables if each type were handled separately) and a total of 303 entries in the detail table (there are 50 "States" and 42 "Providers", so the average number of 'rows' per type is only about 7). Can forty, identical, seven row tables really be more efficient than the header-detail pair? Hardly!