Monday, August 31, 2015

Singing in the key of primary - Why your primary keys should be artificial



I believe very firmly that a primary key should always be an artificial key.  Really.  Many folks talk about having a “natural key” that can be used as a primary key.  I believe that this can appear to work but will eventually fail.  So let’s take a look at what this is all about.  

First and foremost why do we have a primary key?  Most folks believe it is there so we can select the record quickly.  After all if I have the primary key for a row it will be a unique look up since the primary key by definition is unique and not null (otherwise it’s not a primary key).  That is true, sort of. 

A primary key is there so we can join it to another table.  Really, that’s why they exist.  Not so you and I can select the record directly, it’s there so we can join this table to another and get rows that are related to one and other.  The classic simple example is the good old Oracle tables EMP and DEPT.  In the DEPT table we have the column DEPTNO as its primary key.  In EMP table one of its columns is DEPTNO which points to the record in the DEPT table that the employee works in.  This is the standard primary key to foreign key relationship in a relational database.  This relationship is the very cornerstone that the whole relational model stands on.  Without this nothing else really matters.

So what is a good column to have a primary key on?  Of course out of the gate it has to be unique and not null.  But what else? One other thing that should be true is that they don’t change.  And a problem when you look for a “natural key” is that you can’t guarantee that it wouldn’t change or remain unique.  If the data in the primary key column is generated outside of your database, there is no way you can have a 100% guarantee it’s solid and will stay that way for ever.

Another thing to consider is that it really should be one column, not several.  Think again about why a primary key exists, to be a foreign key in another table.  If it’s one column it’s easy to “push” that to another table.  But what about 2 or 3 columns?  Now it starts to get cumbersome.  Also what about a many to many relationship?  The primary key for both tables is in the resolution table, both sides have 2 columns each, now it’s a 4 column resolution table not 2 columns.  So what?  Multiply that out about a million times and you see it starts to be an issue. And since there is likely an index on the resolution table as well, this will cause the index to approach twice the size.

Sure there are database level features to combat these like compression and partitioning and all that.  But what if we had made better choices to start with?  Then it would be longer before we have to invoke such things, if at all.  For features that require a license fee, this could result in a financial savings. And these features just push the performance and storage problems down the road; they are still there just hidden for the moment.

The primary key should have nothing to do with reality.  If it does, it’s just a matter of time before it changes; the only constant in the universe is change after all.  Once it changes that is where the real problems can come in.  Imaging having to redo all the primary keys in a core table of your application.  Scary thought I know. 

So what should they be?  To me the answer is simple - Every primary key should be an artificial key.  The best key would be generated by a unique random number or string generator.  But that is at least really hard to do so the next best thing is a numeric sequencer.   Also I would purpose that in a really well designed system this key is not displayed to the users at all.  Remember this thing is ultimately for the relationships between tables.  Not really for row selection.  I’m all for other alternate keys like an employee ID that is based on the date of hire, SSNs, product codes, account numbers, or similar columns to be used for select rows.  Those are excellent and might even be unique and not null, well maybe the not null part will work at least.   The unique part might work for a while at least. Remember the argument for using SSNs for a person’s unique ID? Well that might be a problem.

Unique identifiers are very different from primary keys.  They sure can seem to be the same and sometime even have the same attributes (like unique and not null) but one is really for row selection (an identifier) and the other for setting up for foreign keys (the primary key).