Ric Ramblings: Singing in the key of primary - Why your primary keys should be artificial

Monday, August 31, 2015

Singing in the key of primary - Why your primary keys should be artificial

I believe very firmly that a primary key should always be an artificial key. Really. Many folks talk about having a “natural key” that can be used as a primary key. I believe that this can appear to work but will eventually fail. So let’s take a look at what this is all about.

First and foremost why do we have a primary key? Most folks believe it is there so we can select the record quickly. After all if I have the primary key for a row it will be a unique look up since the primary key by definition is unique and not null (otherwise it’s not a primary key). That is true, sort of.

A primary key is there so we can join it to another table. Really, that’s why they exist. Not so you and I can select the record directly, it’s there so we can join this table to another and get rows that are related to one and other. The classic simple example is the good old Oracle tables EMP and DEPT. In the DEPT table we have the column DEPTNO as its primary key. In EMP table one of its columns is DEPTNO which points to the record in the DEPT table that the employee works in. This is the standard primary key to foreign key relationship in a relational database. This relationship is the very cornerstone that the whole relational model stands on. Without this nothing else really matters.

So what is a good column to have a primary key on? Of course out of the gate it has to be unique and not null. But what else? One other thing that should be true is that they don’t change. And a problem when you look for a “natural key” is that you can’t guarantee that it wouldn’t change or remain unique. If the data in the primary key column is generated outside of your database, there is no way you can have a 100% guarantee it’s solid and will stay that way for ever.

Another thing to consider is that it really should be one column, not several. Think again about why a primary key exists, to be a foreign key in another table. If it’s one column it’s easy to “push” that to another table. But what about 2 or 3 columns? Now it starts to get cumbersome. Also what about a many to many relationship? The primary key for both tables is in the resolution table, both sides have 2 columns each, now it’s a 4 column resolution table not 2 columns. So what? Multiply that out about a million times and you see it starts to be an issue. And since there is likely an index on the resolution table as well, this will cause the index to approach twice the size.

Sure there are database level features to combat these like compression and partitioning and all that. But what if we had made better choices to start with? Then it would be longer before we have to invoke such things, if at all. For features that require a license fee, this could result in a financial savings. And these features just push the performance and storage problems down the road; they are still there just hidden for the moment.

The primary key should have nothing to do with reality. If it does, it’s just a matter of time before it changes; the only constant in the universe is change after all. Once it changes that is where the real problems can come in. Imaging having to redo all the primary keys in a core table of your application. Scary thought I know.

So what should they be? To me the answer is simple - Every primary key should be an artificial key. The best key would be generated by a unique random number or string generator. But that is at least really hard to do so the next best thing is a numeric sequencer. Also I would purpose that in a really well designed system this key is not displayed to the users at all. Remember this thing is ultimately for the relationships between tables. Not really for row selection. I’m all for other alternate keys like an employee ID that is based on the date of hire, SSNs, product codes, account numbers, or similar columns to be used for select rows. Those are excellent and might even be unique and not null, well maybe the not null part will work at least. The unique part might work for a while at least. Remember the argument for using SSNs for a person’s unique ID? Well that might be a problem.

Unique identifiers are very different from primary keys. They sure can seem to be the same and sometime even have the same attributes (like unique and not null) but one is really for row selection (an identifier) and the other for setting up for foreign keys (the primary key).

11 comments:

Mark J. BobakAugust 31, 2015 at 8:54 AM
100% agree.....you're preaching to the choir....:-)
ReplyDelete
Replies
Franck PachotAugust 31, 2015 at 9:33 AM
Hi,
Why do you say that primary key are for foreign key reference? Foreign key can reference any unique identifier.
The fact that it is 'primary' is 1/ about storage 2/ must be know from the beginning (bcz not null). Very well suited for natural key.
You can add artificial one as unique constraints and reference them by foreign keys.
Regards,
Franck
ReplyDelete
Replies
Joel GarrySeptember 2, 2015 at 2:38 PM
I'm not a member of the choir. As someone who has to deal with complex enterprise apps that include variable design elements used in different ways by different customers, including sometimes crazy customization, I'll take a composite key over a surrogate any day. The "unchanging key" is a false economy, it makes it more abstract and difficult when the data actually changes. Sometimes requirements stipulate attributes within a row aren't known when the row is added. Sometimes bad programming or incomplete integrity rules make bad data.

I'm wondering if the $Billion lawsuit between Albertsons and Haggen was due to bad data.
ReplyDelete
Replies
John FlackSeptember 14, 2015 at 8:39 AM
Joel Garry, when your primary key is there for the main purpose of foreign key reference, and your primary key changes, that means that EVERY foreign key reference to that primary key MUST change. You have to do it in the correct order, and you may have to temporarily disable the foreign key constraints, and while they are disabled, you need to disable your application or risk data corruption. Also, some tables don't HAVE a unique natural key - in EMP, do you really think that the first name and last name together are unique? What about Joe Smith Sr, and Joe Smith Jr? There is an Anglican Bishop in England named John Flack - not me.
Ric Van Dyke: I sometimes make an exception for LOV table, with a Code/Description sort of structure. The codes are somewhat meaningful, but they ARE inventions, like US State postal abbreviations. This is because it is useful to see the code in the data table, and know what it means without having to do the join to the LOV table.
ReplyDelete
Replies
DavidSeptember 15, 2015 at 10:46 PM
Key values change - it's a fact. An "artificial" key (whatever that word means) is in some circumstances less likely to change but you are trying to make a distinction between primary keys and other keys that in practice doesn't exist. It's incorrect to say that a primary key is the "cornerstone" of the relational model. The basis of the relational model is relations with keys (one or more key per relation). In relational terms there is no difference between one key and another. The primary-ness of keys is really a matter of convention and the convenience of software developers.
ReplyDelete
Replies

Add comment