I believe very firmly that a primary key should always be an
artificial key. Really. Many folks talk about having a “natural key”
that can be used as a primary key. I
believe that this can appear to work but will eventually fail. So let’s take a look at what this is all about.
First and foremost why do we have a primary key? Most folks believe it is there so we can
select the record quickly. After all if
I have the primary key for a row it will be a unique look up since the primary
key by definition is unique and not null (otherwise it’s not a primary key). That is true, sort of.
A primary key is there so we can join it to another
table. Really, that’s why they
exist. Not so you and I can select the
record directly, it’s there so we can join this table to another and get rows
that are related to one and other. The
classic simple example is the good old Oracle tables EMP and DEPT. In the DEPT table we have the column DEPTNO
as its primary key. In EMP
table one of its columns is DEPTNO which points to the record
in the DEPT
table that the employee works in. This
is the standard primary key to foreign key relationship in a relational
database. This relationship is the very
cornerstone that the whole relational model stands on. Without this nothing else really matters.
So what is a good column to have a primary key on? Of course out of the gate it has to be unique
and not null. But what else? One other
thing that should be true is that they don’t change. And a problem when you look for a “natural
key” is that you can’t guarantee that it wouldn’t change or remain unique. If the data in the primary key column is
generated outside of your database, there is no way you can have a 100%
guarantee it’s solid and will stay that way for ever.
Another thing to consider is that it really should be one
column, not several. Think again about
why a primary key exists, to be a foreign key in another table. If it’s one column it’s easy to “push” that
to another table. But what about 2 or 3
columns? Now it starts to get
cumbersome. Also what about a many to
many relationship? The primary key for
both tables is in the resolution table, both sides have 2 columns each, now it’s
a 4 column resolution table not 2 columns.
So what? Multiply that out about
a million times and you see it starts to be an issue. And since there is likely an index on the resolution table as well, this will cause the index to
approach twice the size.
Sure there are database level features to combat these like
compression and partitioning and all that.
But what if we had made better choices to start with? Then it would be longer before we have to
invoke such things, if at all. For
features that require a license fee, this could result in a financial savings. And
these features just push the performance and storage problems down the road;
they are still there just hidden for the moment.
The primary key should have nothing to do with reality. If it does, it’s just a matter of time before
it changes; the only constant in the universe is change after all. Once it changes that is where the real
problems can come in. Imaging having to
redo all the primary keys in a core table of your application. Scary thought I know.
So what should they be?
To me the answer is simple - Every primary key should be an artificial
key. The best key would be generated by
a unique random number or string generator.
But that is at least really hard to do so the next best thing is a numeric
sequencer. Also I would purpose that in
a really well designed system this key is not displayed to the users at
all. Remember this thing is ultimately for
the relationships between tables. Not
really for row selection. I’m all for
other alternate keys like an employee ID that is based on the date of hire, SSNs,
product codes, account numbers, or similar columns to be used for select rows. Those are excellent and might even be unique
and not null, well maybe the not null part will work at least. The
unique part might work for a while at least. Remember the argument for using
SSNs for a person’s unique ID? Well that might be a problem.
Unique identifiers are very different from primary
keys. They sure can seem to be the same
and sometime even have the same attributes (like unique and not null) but one
is really for row selection (an identifier) and the other for setting up for
foreign keys (the primary key).