Monday, August 31, 2015

Singing in the key of primary - Why your primary keys should be artificial



I believe very firmly that a primary key should always be an artificial key.  Really.  Many folks talk about having a “natural key” that can be used as a primary key.  I believe that this can appear to work but will eventually fail.  So let’s take a look at what this is all about.  

First and foremost why do we have a primary key?  Most folks believe it is there so we can select the record quickly.  After all if I have the primary key for a row it will be a unique look up since the primary key by definition is unique and not null (otherwise it’s not a primary key).  That is true, sort of. 

A primary key is there so we can join it to another table.  Really, that’s why they exist.  Not so you and I can select the record directly, it’s there so we can join this table to another and get rows that are related to one and other.  The classic simple example is the good old Oracle tables EMP and DEPT.  In the DEPT table we have the column DEPTNO as its primary key.  In EMP table one of its columns is DEPTNO which points to the record in the DEPT table that the employee works in.  This is the standard primary key to foreign key relationship in a relational database.  This relationship is the very cornerstone that the whole relational model stands on.  Without this nothing else really matters.

So what is a good column to have a primary key on?  Of course out of the gate it has to be unique and not null.  But what else? One other thing that should be true is that they don’t change.  And a problem when you look for a “natural key” is that you can’t guarantee that it wouldn’t change or remain unique.  If the data in the primary key column is generated outside of your database, there is no way you can have a 100% guarantee it’s solid and will stay that way for ever.

Another thing to consider is that it really should be one column, not several.  Think again about why a primary key exists, to be a foreign key in another table.  If it’s one column it’s easy to “push” that to another table.  But what about 2 or 3 columns?  Now it starts to get cumbersome.  Also what about a many to many relationship?  The primary key for both tables is in the resolution table, both sides have 2 columns each, now it’s a 4 column resolution table not 2 columns.  So what?  Multiply that out about a million times and you see it starts to be an issue. And since there is likely an index on the resolution table as well, this will cause the index to approach twice the size.

Sure there are database level features to combat these like compression and partitioning and all that.  But what if we had made better choices to start with?  Then it would be longer before we have to invoke such things, if at all.  For features that require a license fee, this could result in a financial savings. And these features just push the performance and storage problems down the road; they are still there just hidden for the moment.

The primary key should have nothing to do with reality.  If it does, it’s just a matter of time before it changes; the only constant in the universe is change after all.  Once it changes that is where the real problems can come in.  Imaging having to redo all the primary keys in a core table of your application.  Scary thought I know. 

So what should they be?  To me the answer is simple - Every primary key should be an artificial key.  The best key would be generated by a unique random number or string generator.  But that is at least really hard to do so the next best thing is a numeric sequencer.   Also I would purpose that in a really well designed system this key is not displayed to the users at all.  Remember this thing is ultimately for the relationships between tables.  Not really for row selection.  I’m all for other alternate keys like an employee ID that is based on the date of hire, SSNs, product codes, account numbers, or similar columns to be used for select rows.  Those are excellent and might even be unique and not null, well maybe the not null part will work at least.   The unique part might work for a while at least. Remember the argument for using SSNs for a person’s unique ID? Well that might be a problem.

Unique identifiers are very different from primary keys.  They sure can seem to be the same and sometime even have the same attributes (like unique and not null) but one is really for row selection (an identifier) and the other for setting up for foreign keys (the primary key). 

11 comments:

  1. 100% agree.....you're preaching to the choir....:-)

    ReplyDelete
    Replies
    1. Apparently all the members of the choir are not on the same page. :-)

      Delete
  2. Hi,
    Why do you say that primary key are for foreign key reference? Foreign key can reference any unique identifier.
    The fact that it is 'primary' is 1/ about storage 2/ must be know from the beginning (bcz not null). Very well suited for natural key.
    You can add artificial one as unique constraints and reference them by foreign keys.
    Regards,
    Franck

    ReplyDelete
    Replies
    1. Hello Frank,
      "Why do you say that primary key are for foreign key reference?" Because that is what they are for, that is why I say it. Why else have a Primary Key? Sure it's cool to select the row with it, but really it's there for referential integrity. Sure Databases (like Oracle) will let you setup a foreign key on "any" unique key, but just because you can doesn't mean you should. And what does the Primary Key have to do with storage? Sorry you lost me on that one.
      - Ric

      Delete
    2. Hi, I am sorry I thought I had replied.
      I am saying its for storage because I am thinking about IOT for Oracle, clustered tables and primary (sparse) indexes in other DBMS.
      Physically ordering rows is the only thing we can do only with PK. Query, reference, etc need a unique key only.
      Primary means that there can be only one, and that is for the physical order.
      regards,
      Franck

      Delete
    3. Franck, I still don't follow. "Physically ordering rows is the only thing we can do only with PK." I really don't understand this. I don't know other databases, but in Oracle-Land there is no order of day in a table. Yes there is an order to the rows in an IOT (which is really an index not a table). And yes an IOT requires a primary key. To say that a PK is for physical ordering I can't agree with in any way. I don't follow your argument. Maybe I'm misunderstand something here.
      - Ric

      Delete
  3. I'm not a member of the choir. As someone who has to deal with complex enterprise apps that include variable design elements used in different ways by different customers, including sometimes crazy customization, I'll take a composite key over a surrogate any day. The "unchanging key" is a false economy, it makes it more abstract and difficult when the data actually changes. Sometimes requirements stipulate attributes within a row aren't known when the row is added. Sometimes bad programming or incomplete integrity rules make bad data.

    I'm wondering if the $Billion lawsuit between Albertsons and Haggen was due to bad data.

    ReplyDelete
    Replies
    1. Joel,
      Funny but I see your argument and actually more of reason to use an artificial key and not as you intend as a reason not to use it. There is no reason to ever change an artificial key when the data changes. That's the whole point. If the PK is an abstract meaningless number then every other column in the row can change and there is no reason to change the key. As soon as the PK is "natural" there is a potential that as the other columns in the row changes the PK will have to change, and as John Flack points out, this is a major problem.
      - Ric

      Delete
  4. Joel Garry, when your primary key is there for the main purpose of foreign key reference, and your primary key changes, that means that EVERY foreign key reference to that primary key MUST change. You have to do it in the correct order, and you may have to temporarily disable the foreign key constraints, and while they are disabled, you need to disable your application or risk data corruption. Also, some tables don't HAVE a unique natural key - in EMP, do you really think that the first name and last name together are unique? What about Joe Smith Sr, and Joe Smith Jr? There is an Anglican Bishop in England named John Flack - not me.
    Ric Van Dyke: I sometimes make an exception for LOV table, with a Code/Description sort of structure. The codes are somewhat meaningful, but they ARE inventions, like US State postal abbreviations. This is because it is useful to see the code in the data table, and know what it means without having to do the join to the LOV table.

    ReplyDelete
    Replies
    1. Hi John!
      Roger that. There are always exceptions, but still we should be careful. That "little LOV" table might today be just that, but sometimes they grow into things we didn't foresee. If we are lucky as they change we can modify it as it does and keep sanity in the system.
      - Ric

      Delete
  5. Key values change - it's a fact. An "artificial" key (whatever that word means) is in some circumstances less likely to change but you are trying to make a distinction between primary keys and other keys that in practice doesn't exist. It's incorrect to say that a primary key is the "cornerstone" of the relational model. The basis of the relational model is relations with keys (one or more key per relation). In relational terms there is no difference between one key and another. The primary-ness of keys is really a matter of convention and the convenience of software developers.

    ReplyDelete