Welcome to Foxite.COM Community Weblog Sign in | Join | Help

What is Data Normalization?

Data normalization is a technique that is applied when designing relational database tables in order to minimize the duplication of information and, by so doing, safeguard the database against logical or structural anomalies that could compromise the integrity of the data. By reducing the amount of data duplication, or 'redundancy' normalization achieves four things:

·         Minimize the amount of space required to hold the database   By removing the need to store the same information in more than one place, the physical size of the data is restricted to a minimum
·         Remove the risk of inconsistencies    Since each data element occurs only once, there can be no inconsistency between 'versions' of the same information
·         Minimize possible Update and Delete anomalies   Since each element occurs only once, there is no need to repeat operations in multiple locations when there is the risk that one may succeed while another fails. Additionally by ensuring that all columns in a table are dependent on the primary key, referential integrity is easily enforced
·         Maximize the stability of the data structure   One thing that is certain about any computer application is that the requirements will change. A stable data structure can handle changes to processing requirements without requiring direct modification of the data structure

The technique defines a series of distinct stages, or "Forms", which represent successively greater degrees of sophistication and that are achieved by the application of ever more rigorous rules. The process begins by reducing the data to First Normal Form (1NF) and then continues to the Fifth Normal Form, however, when using surrogate primary keys there is no need to go beyond the Boyce Codd Variation of Third Normal Form (higher levels only apply when the primary key is composed of more than a single column).

First Normal Form

Reduce tables to first normal form (1NF) by removing repeating or multi-valued fields to another, child table.

The obvious question here is what constitutes "repeating or multi-valued" fields. The data requirements for a "Customer Contact List" might include the following information:

·         Account Number
·         Customer Name
·         Primary Contact Name and Telephone Number
·         Alternate Contact Name and Telephone Number

In a flat-file based system we would create a spreadsheet with these items as columns as illustrated:

Table 1: Flat-File Database

Clearly the "contact" and "phone" fields are actually multi-valued fields. The data in each is identical (a name and phone number) and the only difference is in the way in which they are interpreted (i.e. as either the primary or alternate contact). To reduce this data to first normal form we must, according to the rule, remove these columns into their own table and link it to their owning record with a foreign key. An additional key can be used to handle the "interpretation" by linking each contact record to a lookup table that defines the possible uses for the information as illustrated:

Note: The 'crows foot' on the linking line indicates that there are many records in the table for each occurrence of the related key. Thus for each Customer there can be many records in the Contact table.

Figure 1: The contact list after first pass at reduction to 1NF

The benefit of this structure is that we can now add as many contacts as we wish to our customer – we are no longer limited to just 2. However we are not yet finished.

Look closely at the "Contact Table". Although we have now removed the multi-valued fields into their own table, we still have potentially repeating fields. Each record in this table would require a repetition of the contact's name whenever a contact had more than one associated item of information (e.g. Phone, Cell and Fax numbers). So, in accordance with the rule we should now remove contact detail into a separate table (and we can again create a lookup for the definition of the information). The result is the structure shown in Figure 2

The benefits of this structure are clear. We can now handle an unlimited number of contacts per customer – by simply adding new records to the 'Contact' table and linking them to the appropriate customer record. We can also add new types of contact (by adding records to the Contact Type table). Moreover it is now possible to define an unlimited amount of detail per contact.

Remember that one of the key goals of normalization was to create a stable structure that does not need to change when requirements change. So, if we decided we now need to capture mobile phone numbers, or our contact's birthday to send them a card, we could do so by adding a new detail type record, and adding the appropriate information to a new record in the Contact_Detail table. No structural changes to the database are needed!

Figure 2: The contact list in First Normal Form

Notice also that each element of data appears once, and only once, in only one table (Figure 3). Thus the only data in the Customer table is that which relates directly to the customer entity (in this case it happens to be just a name and an account number). Similarly the only data in the Contact table is that which relates directly to the contact entity - in this case it is just a name.

To reconstruct the original 'flat' data we simply join the tables using their related keys. However the most significant benefit is that we can now also access the data in different ways. For example if we wanted a list of all companies for which we hold a fax number we can do it by selecting only those Customer records which have related contacts with a detail record that includes Type = 2 (defined as a Fax Number).

By separating the data into its component parts in this way we also avoid the possibility of error when updating information because we only ever have to update a single record. So, with reference to Figure 3 below, if Alan Anders (Contact_PK = 1) changes his phone number the only record that has to change is the detail table record where contact_fk = 1 and dettype_fk = 1 (phone). 

The same thing applies when deleting information. If Colin Charles (whose Contact_PK  = 5) were to leave "Canny Castings" we simply  delete his record in the "Contact" table, and any records in the "Contact_Detail" table where the "contact_fk = 5" and any application that displays contact information will immediately reflect the change.

Figure 3: The data from the spreadsheet in relational tables

Second Normal Form

Reduce 1NF tables to second normal form (2NF) by removing fields that are not dependent on the whole primary key.

This deals primarily with cases where a surrogate key is not being used, and the primary key consists of more than a single field. Consider the following partial definition for an invoice table:

·         Invoice Number
·         Line Number
·         Customer ID
·         Product ID

In the absence of a surrogate key, the minimum unique identifier for this table is the combination of Invoice and Line Numbers. Now the question is, do each of the other data elements depend on this key? Clearly the Product ID does - since the product ID is defined by each line on a specific invoice. But the Customer ID is not. It depends only on the invoice number – the line number is irrelevant in the context of the customer.

The solution is, therefore, to remove the Customer ID and Invoice ID to a new table, which is the "Invoice Header" and keep the line item information as the "Invoice Detail", as illustrated in Figure 4, where we have also added a surrogate primary key to the new table to avoid using the Invoice Number directly as the primary key.

Figure 4: 2NF resolution of the invoice data

As illustrated this approach immediately solves another, related, issue. How to differentiate and save the information that relates to the invoice as a whole (e.g. order date, due date, total price, delivery and shipping information) from the information related to individual line items (e.g. product, quantity, price, discount, margin).

Third Normal Form

Reduce 2NF tables to third normal form (#NF) by removing fields, other than candidate keys, that depend on other non-key fields

Essentially this rule is saying that all fields in a table must be dependent on the primary key, and only the primary. This is often referred to as meaning 'the key, the whole key and nothing but the key'. Consider the following partial specification for an ORDERS table:

Table 2: Order Table specification

This table actually conforms to 2NF (the Vendor ID, and the Location of the Vendor are both dependent on the order)  but fails 3NF because the Vendor Location is also dependent on the VendorID and therefore should not be included in this table.

Looking at the invoice header table in Figure 4, we can see that it is, in fact, in 3NF. All of the fields identified (except the foreign key to the customer table) depend directly upon the invoice number – which is itself a candidate key since it must be unique within the table.

Boyce-Codd Normal Form

Reduce 3NF table to Boyce-Codd normal form (BCNF) by ensuring that all tables are in 3NF for any feasible choice of candidate key as the primary key

This variant of the 3NF rule is specifically applicable when using surrogate primary keys. Since the surrogate key is, by definition unique within the table, we need to ensure that the table is in fact in 3NF for any other field that could, potentially, be used as a primary key. The Invoice header table in Figure 4 above contains such a candidate key (the invoice number) and the table, as shown, does conform to BCNF since all columns are also dependent directly on the invoice number.

Fourth and Fifth Normal Forms

There are two additional normal forms in database theory but they are concerned only with cases where the primary key for a table is comprised of multiple columns and so, when using a surrogate key, are not applicable. This, in itself, is another reason for using surrogate keys when designing relational tables.

Summary

Normalization is designed to improve the efficiency with which data is stored in a database by reducing data redundancy and eliminating the need for complex mechanisms to manage changes to data. A properly normalized database will be fast and easy to manage and maintain, while a poorly normalized structure will be sluggish and difficult to manage and maintain.

 

Published Sunday, September 07, 2008 2:13 AM by andykr
Filed Under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# re: What is Data Normalization?

Monday, September 08, 2008 1:59 PM by Anders Altberg

Neat work on normalization.

Would there be an invoice with no reference to an Order?

Anders

Hi Anders

Thanks for the comment. As to your question, I specifically said, in the text, 'consider the partial definition for an invoice table'...because that is really a business issue, not a data design one and the point of the article was not to propose a universal model for invoices, but to illustrate the point of normalization.  -- Andy

# What Fields should be in the CONTACT_DETAIL Table

Sunday, April 26, 2009 4:15 AM by Dan Baker

Great briefing here, Andy.

I have a question about how you implement this data normalized database.

In the CONTACT_DETAIL table example, I'm not sure how you can limit that table to only 4 fields.

Yes, having a generic DETAILS field to park details like phone numbers and emails works.  But what about MEMO details or NUMERIC details such as "number of years at present company"?  And birthdates would require a DATE field, wouldn't they?

I'm curious how would you design the table to handle that diversity.

First, the point of this article is Database Design, not table design. Second, I am not trying to prescribe the contents of a table - see my articles on normalization for these issues but, in short it will depend on the data to be captured and the purpose of the application. Yes, if you need to capture data in a memo format you will need a memo column, similarly for dates, integers and any other data type.
However, wouldn't you agree that the purpose of a Contact Detail table is probably to store details of how to contact some entity (i.e. a person or an organization)? If so you have a limited number of options: i.e. Either a contact number for Phone/Fax (various types of course - Home, Office, Cell, Pager) or EMail. Not really many other options that I can think of.
The key to normalizing data is to ensure that only those columns that depend directly on the key of the table are stored there. Thus Date of birth is dependent on the Person to whom it relates, not the method for contacting them. Similarly the "Number of years at company" depends on the combination of Person and Position - i.e. Employee data - so again it is nothing to do with Contact Details.
So the only answer I can give you is that Addresses would be in the Address Table, Date of Birth in the Person table and Length of Employment in the Employee table.None of these has anything at all to do with contact details and I would never have all these diverse fields in one table - that is the whole point of normalization!

In addition, am I correct in assuming that your user interface controls such as TextBox will not be able to capture data using the CONTROL SOURCE because you need to buffer the data?

I am sorry, this is completely meaningless to me. What has buffering got to do with UI controls? And what have UI controls got to do with database design?  -- Andy


What do you think?

(required) 
required 
(required)