What is Data Normalization?
Data normalization is a technique that is applied when designing relational database tables in order to minimize the duplication of information and, by so doing, safeguard the database against logical or structural anomalies that could compromise the integrity of the data. By reducing the amount of data duplication, or 'redundancy' normalization achieves four things:
· Minimize the amount of space required to hold the database By removing the need to store the same information in more than one place, the physical size of the data is restricted to a minimum
· Remove the risk of inconsistencies Since each data element occurs only once, there can be no inconsistency between 'versions' of the same information
· Minimize possible Update and Delete anomalies Since each element occurs only once, there is no need to repeat operations in multiple locations when there is the risk that one may succeed while another fails. Additionally by ensuring that all columns in a table are dependent on the primary key, referential integrity is easily enforced
· Maximize the stability of the data structure One thing that is certain about any computer application is that the requirements will change. A stable data structure can handle changes to processing requirements without requiring direct modification of the data structure
The technique defines a series of distinct stages, or "Forms", which represent successively greater degrees of sophistication and that are achieved by the application of ever more rigorous rules. The process begins by reducing the data to First Normal Form (1NF) and then continues to the Fifth Normal Form, however, when using surrogate primary keys there is no need to go beyond the Boyce Codd Variation of Third Normal Form (higher levels only apply when the primary key is composed of more than a single column).
First Normal Form
Reduce tables to first normal form (1NF) by removing repeating or multi-valued fields to another, child table.
The obvious question here is what constitutes "repeating or multi-valued" fields. The data requirements for a "Customer Contact List" might include the following information:
· Account Number
· Customer Name
· Primary Contact Name and Telephone Number
· Alternate Contact Name and Telephone Number
In a flat-file based system we would create a spreadsheet with these items as columns as illustrated:
Table 1: Flat-File Database

Clearly the "contact" and "phone" fields are actually multi-valued fields. The data in each is identical (a name and phone number) and the only difference is in the way in which they are interpreted (i.e. as either the primary or alternate contact). To reduce this data to first normal form we must, according to the rule, remove these columns into their own table and link it to their owning record with a foreign key. An additional key can be used to handle the "interpretation" by linking each contact record to a lookup table that defines the possible uses for the information as illustrated:
Note: The 'crows foot' on the linking line indicates that there are many records in the table for each occurrence of the related key. Thus for each Customer there can be many records in the Contact table.

Figure 1: The contact list after first pass at reduction to 1NF
The benefit of this structure is that we can now add as many contacts as we wish to our customer – we are no longer limited to just 2. However we are not yet finished.
Look closely at the "Contact Table". Although we have now removed the multi-valued fields into their own table, we still have potentially repeating fields. Each record in this table would require a repetition of the contact's name whenever a contact had more than one associated item of information (e.g. Phone, Cell and Fax numbers). So, in accordance with the rule we should now remove contact detail into a separate table (and we can again create a lookup for the definition of the information). The result is the structure shown in Figure 2
The benefits of this structure are clear. We can now handle an unlimited number of contacts per customer – by simply adding new records to the 'Contact' table and linking them to the appropriate customer record. We can also add new types of contact (by adding records to the Contact Type table). Moreover it is now possible to define an unlimited amount of detail per contact.
Remember that one of the key goals of normalization was to create a stable structure that does not need to change when requirements change. So, if we decided we now need to capture mobile phone numbers, or our contact's birthday to send them a card, we could do so by adding a new detail type record, and adding the appropriate information to a new record in the Contact_Detail table. No structural changes to the database are needed!

Figure 2: The contact list in First Normal Form
Notice also that each element of data appears once, and only once, in only one table (Figure 3). Thus the only data in the Customer table is that which relates directly to the customer entity (in this case it happens to be just a name and an account number). Similarly the only data in the Contact table is that which relates directly to the contact entity - in this case it is just a name.
To reconstruct the original 'flat' data we simply join the tables using their related keys. However the most significant benefit is that we can now also access the data in different ways. For example if we wanted a list of all companies for which we hold a fax number we can do it by selecting only those Customer records which have related contacts with a detail record that includes Type = 2 (defined as a Fax Number).
By separating the data into its component parts in this way we also avoid the possibility of error when updating information because we only ever have to update a single record. So, with reference to Figure 3 below, if Alan Anders (Contact_PK = 1) changes his phone number the only record that has to change is the detail table record where contact_fk = 1 and dettype_fk = 1 (phone).
The same thing applies when deleting information. If Colin Charles (whose Contact_PK = 5) were to leave "Canny Castings" we simply delete his record in the "Contact" table, and any records in the "Contact_Detail" table where the "contact_fk = 5" and any application that displays contact information will immediately reflect the change.

Figure 3: The data from the spreadsheet in relational tables
Second Normal Form
Reduce 1NF tables to second normal form (2NF) by removing fields that are not dependent on the whole primary key.
This deals primarily with cases where a surrogate key is not being used, and the primary key consists of more than a single field. Consider the following partial definition for an invoice table:
· Invoice Number
· Line Number
· Customer ID
· Product ID
In the absence of a surrogate key, the minimum unique identifier for this table is the combination of Invoice and Line Numbers. Now the question is, do each of the other data elements depend on this key? Clearly the Product ID does - since the product ID is defined by each line on a specific invoice. But the Customer ID is not. It depends only on the invoice number – the line number is irrelevant in the context of the customer.
The solution is, therefore, to remove the Customer ID and Invoice ID to a new table, which is the "Invoice Header" and keep the line item information as the "Invoice Detail", as illustrated in Figure 4, where we have also added a surrogate primary key to the new table to avoid using the Invoice Number directly as the primary key.

Figure 4: 2NF resolution of the invoice data
As illustrated this approach immediately solves another, related, issue. How to differentiate and save the information that relates to the invoice as a whole (e.g. order date, due date, total price, delivery and shipping information) from the information related to individual line items (e.g. product, quantity, price, discount, margin).
Third Normal Form
Reduce 2NF tables to third normal form (#NF) by removing fields, other than candidate keys, that depend on other non-key fields
Essentially this rule is saying that all fields in a table must be dependent on the primary key, and only the primary. This is often referred to as meaning 'the key, the whole key and nothing but the key'. Consider the following partial specification for an ORDERS table:
Table 2: Order Table specification

This table actually conforms to 2NF (the Vendor ID, and the Location of the Vendor are both dependent on the order) but fails 3NF because the Vendor Location is also dependent on the VendorID and therefore should not be included in this table.
Looking at the invoice header table in Figure 4, we can see that it is, in fact, in 3NF. All of the fields identified (except the foreign key to the customer table) depend directly upon the invoice number – which is itself a candidate key since it must be unique within the table.
Boyce-Codd Normal Form
Reduce 3NF table to Boyce-Codd normal form (BCNF) by ensuring that all tables are in 3NF for any feasible choice of candidate key as the primary key
This variant of the 3NF rule is specifically applicable when using surrogate primary keys. Since the surrogate key is, by definition unique within the table, we need to ensure that the table is in fact in 3NF for any other field that could, potentially, be used as a primary key. The Invoice header table in Figure 4 above contains such a candidate key (the invoice number) and the table, as shown, does conform to BCNF since all columns are also dependent directly on the invoice number.
Fourth and Fifth Normal Forms
There are two additional normal forms in database theory but they are concerned only with cases where the primary key for a table is comprised of multiple columns and so, when using a surrogate key, are not applicable. This, in itself, is another reason for using surrogate keys when designing relational tables.
Summary
Normalization is designed to improve the efficiency with which data is stored in a database by reducing data redundancy and eliminating the need for complex mechanisms to manage changes to data. A properly normalized database will be fast and easy to manage and maintain, while a poorly normalized structure will be sluggish and difficult to manage and maintain.