Welcome to Foxite.COM Community Weblog Sign in | Join | Help

Designing a Database - Part 5

So far I been concentrating on the logical design for a database – the system scope, types of tables, relational patterns and rule management. The next stage is to consider the physical design which is, after all, what an application will have to deal with. In this article I am making a couple of assumptions that I am not going to discuss further here. First, that you are familiar with the concept of data normalization (see my blog "What is Data Normalization" - http://weblogs.foxite.com/andykramek/archive/2008/09/07/6681.aspx ) if you are not. Second, that you use surrogate  primary keys for all tables (if in doubt, see my blog "What is a Key" - http://weblogs.foxite.com/andykramek/archive/2008/09/03/6656.aspx ).

Defining the Data Needs

The first thing that is required to create the physical design for a database is the list of data items that must be captured and managed. Usually this is the result of some form of analysis and typically is presented in the context of either some existing data capture mechanism (e.g. printed forms, screens in an existing application, data transmitted by Email or downloaded from external source) or in terms of some functional need (e.g. Marketing, Sales Order Processing or Customer Management). However the list of data items is derived, you can be certain of one fact; it will not be structured for entry into a relational database.

One of the commonest mistakes that I see made when people are designing databases is that there is a strong tendency to create tables that model the application components, and to provide a separate column in those tables for each element required by the component. You may recall that in the first article in this series I said that

"One of the biggest causes of problems during application development (and maintenance) is the failure of the database to support business processes"

Often this arises because of the way in which the data elements were derived. For example, given an existing application it is 'obvious' how the data should be stored – the problem is that this will usually result in a database that models an application rather than a business process. As the application changes (as it will) the database becomes less and less relevant. So how to avoid that trap?

As described in Part 2 of this little series, there are actually three distinct types of data (Core, Process and Support) and each type must be stored appropriately by the database. Of course, in order to do that, the first thing to do is to identify what type of data is being dealt with. My personal preference is to simply list all of the data elements out, together with their source and any associated information and classify them. I usually create a spreadsheet for this task so that I can easily sort and manipulate the information. This is an example of what my completed data mapping worksheet looks like for a "contact" data set at the end of this article.

Of course this begs the question of whether the Address data should be held in the same table or not. There is no single 'right' answer to this question, it depends on the application. In the example above the "contact" is actually a person who is defined in the context of a specific location. In this particular case, therefore, the address information is directly related to the person and a person, by definition, can only be at one place at any time. Furthermore we really have no interest in historic data (it doesn't matter where Fred used to be, all that matters is where he is right now). Other scenarios will have different requirements and will, therefore, result in different designs.

What should be noted from the mapping is that the majority of columns in this table are for core data – in other words for data that must be specifically entered. This will normally be the case for most of your relational data tables – their primary purpose is, after all, to capture data. In fact, apart from a couple of system generated columns (PK and last update) the only non-core data is a couple of de-normalized columns that are required only  to simplify presentation of information in selection grids.

Defining Indexes

Having decided on the basic structure of a table the next task is to define the indexes that will be needed. I am sometimes asked if the simplest thing is not just to create one index for every column in the table. This is referred to as "inverting" a table and, while this would almost certainly speed up data retrieval (and is, therefore, often done in Data Warehousing implementations) adding or updating records is slowed down by the necessity to maintain all the indexes. In a transactional system (one where we are basically concerned with adding and updating data) it is important to strike a balance between the speed at which data can be retrieved, and the speed at which a new, or updated record, can be saved.

To do this we need to go back to the first article in this series where I advocated the construction of an Entity Relationship Diagram (ERD) that could be "tested" by formulating queries. This now comes into play again when we begin to consider indexes.  

The basic rule of thumb that I use  for creating indexes is that I always create an index for columns that are used in joins (i.e. the Primary and any Foreign keys). Then I look for columns that will routinely be used to filter result sets and add indexes for them too. Of course, that means that I need some way of knowing which columns will be 'routinely used' and that is where my paper prototype comes in. Since I formulated typical queries before I even had data tables I already know what my main filters are going to be and, in the case illustrated above, they would probably be the Last Name, City and (possibly) State.

Now you may be wondering about now that this is all very well for running SQL Queries but what about other things, like displaying data in sortable grids, or using different sort orders in reports. The answer is the same in all cases; you should not rely on indexes to define display orders (in back end servers you cannot do this anyway) instead use SQL to generate result sets for display and include an ORDER BY clause to handle the specific sort order. This way you keep your indexes to the minimum needed to optimize your queries and reduce the overhead when adding or updating records.

Persistent Relationships and Constraints

Different databases implement persistent relationships (Referential Integrity) in different fashions but when reduced to basics they all rely on the Primary/Foreign key relationships and hence on the indexes created for these columns. As with so much in database design, there is no absolute right answer to when you should define persistent relationships and when you should not but, again, your ERD, coupled with your understanding of relational patterns, will help.

Generally you will need a persistent relationship between tables that implement a One-To-Many relationship. This makes it possible to use the database's own mechanisms to handle cascaded deletes (when you delete the parent, all associated child records are also deleted) and to restrict the inadvertent deletions parent records that have child records.

In addition to the persistent relationships you will need Foreign Key Constraints between tables that implement a Many-To-One relationship. These relationships, you will recall, arise in two different ways. Either as lookup values based on a foreign key that may be NULL (as in the Title_FK column in my contact example above) or as the result of the introduction of an allocation table to break a Many-To-Many relationship. In either case it is imperative that, if there is a value in such a column, it only ever refers to a value that exists in the referenced key definition table.

Triggers and Rules

I discussed rules in the last article but want to reiterate, here, that before implementing any rule in the database you should ask yourself a citically important question.

If this rule is broken will it affect the ability of the database to either store, or retrieve, data?

Basically if the answer is "no" then the rule does not really belong in the database and should be implemented in either the business tier or the presentation tier.

That leaves us with the question of triggers. I noted above that to implement referential integrity my preference is always to use built in mechanisms for defining persistent relationships (also known as "Declarative RI"); not only is it simpler to set up it avoids the necessity to write custom code to handle RI and possibly introducing bugs into this critical operation. So what should we use triggers for? Again, there is no simple answer but I find that I use triggers in a couple of different ways.

First, to implement audit logging. There are many possible reasons to log changes to data; satisfying statutory requirements for monitoring the accuracy of data, to track workflow and throughput or simply to record who did what, and when, in case of errors. Triggers are ideal for this purpose because they only fire when an attempt to implement their associated action is made – whether that is Insert, Update or Delete – and they always fire whether that action is initiated by an application or through a direct edit. Of course there are some limitations and, once again, different database implement triggers in different ways but generally logging is an activity that is ideally suited to a trigger.

The second case where I find myself using triggers is when some action must trigger some unrelated action. For example, in one application I was involved with recently the update of a customer record status from "Prospect" to "Conversion" required that various tasks be generated and assigned to specific individuals in the organization. This sort of thing is best handled in a trigger for much the same reasons as logging, it doesn't matter whether the update is made through the application or via a direct update the trigger will always fire and always do its work.

Conclusion

I hope that this little series has been useful to people and, if not actually providing specific answers has at least given some guidance about the sort of things to consider when designing and implementing a database. As I have said several times, there is no single 'right' answer to any of this, but there are lots of ways to get it wrong! In my experience understanding and implementing the guidelines I have describe generally gets me into the right ballpark, if not actually all the way to home base and ensure that, at the very least avoids the most egregious errors that plague so many databases that are in use today.

 

Published Sunday, December 14, 2008 3:20 PM by andykr
Filed Under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# re: Designing a Database - Part 5

Monday, January 12, 2009 3:14 PM by Stefan S.

Greetings Andy,

I've been reading your Blog since a few days now. You're doing great work there, everything is stated very clearly so that, if you have some common knowledge about developing, there's no problem to apply the topics that you mention in your blog on your own projects. Thanks for sharing your knowlege.

Best regards,
Stefan

Thank you Stefan, I hope that I can continue to keep things interesting here.
Regards Andy

What do you think?

(required) 
required 
(required)