The word "key" has a very specific meaning in relational database design. It refers to a field whose value is shared between tables. The purpose of such data duplication is to enable a record in one table to be associated with a record in another, thereby allowing the extraction of information that is distributed between the tables. Such an association is referred to as a 'relationship' and is what differentiates relational databases from flat-file databases.
There are two types of relational key:
· PRIMARY A primary key is the column, or combination of columns, whose value uniquely identifies an individual record within a table. Neither the primary key, nor any part of it, may have a value of NULL
· FOREIGN A foreign key is the column, or combination of columns in one table whose value identifies an individual (i.e. related) record in another table. A foreign key may, under some situations, have a value of NULL indicating that there is no related record
When defining a relationship between two tables, the table containing the primary key is the "Parent" (also known as the "Referenced Table") and the table containing the foreign key is the "Child" (also known as the "Referencing Table"). Thus the foreign key references the primary and thereby defines how the records in the tables are linked.
It is a fundamental rule that all tables must have a primary key. This is obvious if you consider a table that contains the real names of my own immediate family as listed below.
First Name Last Name
Stanley Kramek
Elaine Kramek
Andrew Kramek
Richard Kramek
Angela Kramek
This is fine and shows my father (Stanley) and mother (Elaine) together with my younger brother (Richard) and sister (Angela). This table's primary key is the combination of first and last name - and that is perfectly allowable. However it is not a good design because when my brother married a girl named "Angela" (and yes, he really did) we encounter a problem when we try to add my new sister-in-law to this table:
First Name Last Name
Stanley Kramek
Elaine Kramek
Andrew Kramek
Richard Kramek
Angela Kramek
Angela Kramek
Now we have no way to differentiate between my sister, and my sister-in-law! This table has no longer has a primary key and in the absence of any additional information is effectively useless. Of course, the obvious solution is to add some additional, differentiating, information perhaps 'middle initial' – although that is not a particularly good choice since not everyone has one, and the value might still not be unique, and as it happens it isn't (my brother's middle name is "Maurice", my sister's is "Mary" and my sister-in-law happens to be "Margaret").
First Name Last Name Middle Initial
Stanley Kramek
Elaine Kramek
Andrew Kramek E
Richard Kramek M
Angela Kramek M
Angela Kramek M
The point of this example is that it is critically important to define a primary key for each table so that every record can be uniquely and unambiguously identified. So far we have defined the primary key as a combination of columns that are used to store the data. This is referred to as a "compound" key.
The first problem with compound keys (as this example shows) is that it is perfectly possible to end up with duplicate values in several columns and so such keys tend to get ever-more complex as data is added to the database.
That raises a second problem for compound keys because any table that references a table that has a compound primary key must carry all of the elements of the key. So in order to link any other information to my 'family' table I would need to duplicate all the information in that table - which makes the task of updating information horrendously difficult.
A much better option would be to use some unique value associated with each individual that can be stored in a single column and, fortunately we have just such a thing available to us – our Social Security Number. SSNs can be stored in a single column and have a value that is unique to the person. This will, finally, allow us to identify each record in the table unambiguously. It is, therefore, a perfectly natural choice for the primary key in any table holding information about people.
In fact Social Security Number is an example of what is called a "Natural" Key (also known as either an "Intelligent" or "Business" Key) and, in this example it will work very nicely giving us a final, usable, version of the table as shown.
First Name Last Name Middle Initial SSN
Stanley Kramek 814-56-8975
Elaine Kramek 823-22-4578
Andrew Kramek E 866-41-8537
Richard Kramek M 866-77-5411
Angela Kramek M 866-23-8544
Angela Kramek M 822-39-6425
But hang on! Are Social Security Numbers truly unique? The answer is 'no'. In fact there are rules that define what constitutes a valid SSN and consequently there is a finite limit on the actual number of truly unique SSNs that can exist – additionally there are certainly cases on record of duplicate SSNs being issued.
This highlights the first of several problems associated with the use of natural keys - are they truly unique? In fact, in order to be usable as a primary key, a candidate must possess three attributes:
· Uniqueness: The value must be unique within the table
· Consistency: The value must be capable of being validated in terms of the data type to which it is being assigned. In other words, the specified value must be consistent with the rules for the database. For example, we can be sure that 35/45/2009 is not a valid date, but it is impossible to determine whether "123456789" is an SSN that is missing the dashes that format it, or some other value altogether
· Verifiabity: The value of a natural key refers to some real entity and so we have to be sure that the entity actually exists. For example, we know that an SSN is invalid if the first three digits are higher than '770', but we cannot determine whether "123-45-6789" is a 'real' SSN or not
However these are not the biggest issues associated with the use of natural keys as primary keys. The real problem with them is that they are governed by rules that have nothing to do with their use as a primary key, and so are subject to change. Remember, the definition of a primary key is that it must be unique within a table and it may not be null. However, the rules governing SSNs, Vehicle Identification Numbers, Account Numbers, Invoice or Order Numbers, or any of the dozens of other obvious 'natural' keys have nothing whatever to do with 'being unique within a table' which is the fundamental requirement for a primary key.
Does that matter? Emphatically yes! Since the values are governed by rules that relate to their principal function they are subject to restrictions that are irrelevant in the context of a primary key. For example, a very common requirement for Invoice Numbers is that they should be issued in an unbroken sequence. If an invoice number is also used as a primary key, the system has to ensure that users do not allocate an invoice number that conflicts with that rule, and prevent users from assigning a number and then abandoning the invoice – even though it makes absolutely no difference whether the primary keys form an unbroken sequence or not (so long as they remain unique within the table). Managing the unbroken sequence requirement for a primary key introduces unnecessary complexity (and hence potential for error) into the system.
Change is, of course, the other issue. Since Natural Keys reflect real entities they are liable to change. Primary keys are also stored (as a foreign key) in all related tables, so it is imperative that any changes to the actual value of a primary key must be reflected in all related records. This can have major impacts on a database since it will require that all indexes involving the key be re-built, and may even force physical re-ordering of data within the database. Now, you may be wondering, at this point, why a natural key used as a primary key would ever change. The commonest reason is, of course, 'human error'. Which leads us to the final issue with Natural keys - that of how they are captured in the system.
Natural keys cannot, by definition, be generated by the database. They are also, almost invariably, examples of data that cannot even be derived and so must be input directly by some user (SSN is a good example of this principle; given that the last SSN entered into the system was "876-11-1655" what is the next one going to be? There is, of course, no way to know!).
The issue is that there is a limit to the amount of validation and verification that can be performed and so it is perfectly possible for errors to be entered into the system. It is axiomatic that in the context of data there is no software solution to "Valid, but Wrong". In other words no matter how much validation and checking you do, you cannot detect that the SSN "123-45-6789" should really have been entered as "123-45-6798"!
Of course, it is likely that these errors will, eventually be caught. Then they have to be corrected with all the problems that this raises. In fact, in applications based on natural keys, the routines for handling changes and updates to key values are usually the most complex and difficult code in the entire system. So what is the alternative? Enter the "Surrogate Key"
A surrogate key is simply a column added to the table whose only function is to be the primary key for that table. Typically these columns are simple integers and all major databases provide for the auto-generation of the values. In SQL Server these columns are called "Identity" columns, in VFP "Auto-Incrementing Fields" and other databases use other names but the principles are the same.
All of the issues outlined above in respect of Compound keys, and Natural keys, disappear when a surrogate primary key is introduced. As records are added to a table, the database itself generates the next value for that table and assigns it to the new record. There are no issues because the actual value has no significance whatever – it merely has to be unique within the table and the database itself ensures that this is done properly.
This also means that surrogate keys never need to change. Since the actual value does not matter there is never any reason to change it. Furthermore, because the value is generated as an integer, it requires very little space (4 bytes) and is easily indexed and accessed, making the establishment and maintenance of relationships much more efficient.
In the context of normalization surrogate keys have the added benefit of not requiring analysis any deeper than the Boyce-Codd Normal extension of third normal form. (This is because all further normalization rules apply only to cases where the primary key consists of more than a single column).
In short, there are really no reasons NOT to use a surrogate primary key on all tables in a relational database. In those special cases where data must be generated at different locations and later merged, or when keys must be generated by the client rather than by the server, GUIDs provide a better choice than simple integers as primary keys. But changing the data type does not affect the principle that surrogate keys are the best way to define primary keys.