I believe very firmly that a primary key should always be an
artificial key. Really. Many folks talk about having a “natural key”
that can be used as a primary key. I
believe that this can appear to work but will eventually fail. So let’s take a look at what this is all about.
First and foremost why do we have a primary key? Most folks believe it is there so we can
select the record quickly. After all if
I have the primary key for a row it will be a unique look up since the primary
key by definition is unique and not null (otherwise it’s not a primary key). That is true, sort of.
A primary key is there so we can join it to another
table. Really, that’s why they
exist. Not so you and I can select the
record directly, it’s there so we can join this table to another and get rows
that are related to one and other. The
classic simple example is the good old Oracle tables EMP and DEPT. In the DEPT table we have the column DEPTNO
as its primary key. In EMP
table one of its columns is DEPTNO which points to the record
in the DEPT
table that the employee works in. This
is the standard primary key to foreign key relationship in a relational
database. This relationship is the very
cornerstone that the whole relational model stands on. Without this nothing else really matters.
So what is a good column to have a primary key on? Of course out of the gate it has to be unique
and not null. But what else? One other
thing that should be true is that they don’t change. And a problem when you look for a “natural
key” is that you can’t guarantee that it wouldn’t change or remain unique. If the data in the primary key column is
generated outside of your database, there is no way you can have a 100%
guarantee it’s solid and will stay that way for ever.
Another thing to consider is that it really should be one
column, not several. Think again about
why a primary key exists, to be a foreign key in another table. If it’s one column it’s easy to “push” that
to another table. But what about 2 or 3
columns? Now it starts to get
cumbersome. Also what about a many to
many relationship? The primary key for
both tables is in the resolution table, both sides have 2 columns each, now it’s
a 4 column resolution table not 2 columns.
So what? Multiply that out about
a million times and you see it starts to be an issue. And since there is likely an index on the resolution table as well, this will cause the index to
approach twice the size.
Sure there are database level features to combat these like
compression and partitioning and all that.
But what if we had made better choices to start with? Then it would be longer before we have to
invoke such things, if at all. For
features that require a license fee, this could result in a financial savings. And
these features just push the performance and storage problems down the road;
they are still there just hidden for the moment.
The primary key should have nothing to do with reality. If it does, it’s just a matter of time before
it changes; the only constant in the universe is change after all. Once it changes that is where the real
problems can come in. Imaging having to
redo all the primary keys in a core table of your application. Scary thought I know.
So what should they be?
To me the answer is simple - Every primary key should be an artificial
key. The best key would be generated by
a unique random number or string generator.
But that is at least really hard to do so the next best thing is a numeric
sequencer. Also I would purpose that in
a really well designed system this key is not displayed to the users at
all. Remember this thing is ultimately for
the relationships between tables. Not
really for row selection. I’m all for
other alternate keys like an employee ID that is based on the date of hire, SSNs,
product codes, account numbers, or similar columns to be used for select rows. Those are excellent and might even be unique
and not null, well maybe the not null part will work at least. The
unique part might work for a while at least. Remember the argument for using
SSNs for a person’s unique ID? Well that might be a problem.
Unique identifiers are very different from primary
keys. They sure can seem to be the same
and sometime even have the same attributes (like unique and not null) but one
is really for row selection (an identifier) and the other for setting up for
foreign keys (the primary key).
100% agree.....you're preaching to the choir....:-)
ReplyDeleteApparently all the members of the choir are not on the same page. :-)
DeleteHi,
ReplyDeleteWhy do you say that primary key are for foreign key reference? Foreign key can reference any unique identifier.
The fact that it is 'primary' is 1/ about storage 2/ must be know from the beginning (bcz not null). Very well suited for natural key.
You can add artificial one as unique constraints and reference them by foreign keys.
Regards,
Franck
Hello Frank,
Delete"Why do you say that primary key are for foreign key reference?" Because that is what they are for, that is why I say it. Why else have a Primary Key? Sure it's cool to select the row with it, but really it's there for referential integrity. Sure Databases (like Oracle) will let you setup a foreign key on "any" unique key, but just because you can doesn't mean you should. And what does the Primary Key have to do with storage? Sorry you lost me on that one.
- Ric
Hi, I am sorry I thought I had replied.
DeleteI am saying its for storage because I am thinking about IOT for Oracle, clustered tables and primary (sparse) indexes in other DBMS.
Physically ordering rows is the only thing we can do only with PK. Query, reference, etc need a unique key only.
Primary means that there can be only one, and that is for the physical order.
regards,
Franck
Franck, I still don't follow. "Physically ordering rows is the only thing we can do only with PK." I really don't understand this. I don't know other databases, but in Oracle-Land there is no order of day in a table. Yes there is an order to the rows in an IOT (which is really an index not a table). And yes an IOT requires a primary key. To say that a PK is for physical ordering I can't agree with in any way. I don't follow your argument. Maybe I'm misunderstand something here.
Delete- Ric
I'm not a member of the choir. As someone who has to deal with complex enterprise apps that include variable design elements used in different ways by different customers, including sometimes crazy customization, I'll take a composite key over a surrogate any day. The "unchanging key" is a false economy, it makes it more abstract and difficult when the data actually changes. Sometimes requirements stipulate attributes within a row aren't known when the row is added. Sometimes bad programming or incomplete integrity rules make bad data.
ReplyDeleteI'm wondering if the $Billion lawsuit between Albertsons and Haggen was due to bad data.
Joel,
DeleteFunny but I see your argument and actually more of reason to use an artificial key and not as you intend as a reason not to use it. There is no reason to ever change an artificial key when the data changes. That's the whole point. If the PK is an abstract meaningless number then every other column in the row can change and there is no reason to change the key. As soon as the PK is "natural" there is a potential that as the other columns in the row changes the PK will have to change, and as John Flack points out, this is a major problem.
- Ric
Joel Garry, when your primary key is there for the main purpose of foreign key reference, and your primary key changes, that means that EVERY foreign key reference to that primary key MUST change. You have to do it in the correct order, and you may have to temporarily disable the foreign key constraints, and while they are disabled, you need to disable your application or risk data corruption. Also, some tables don't HAVE a unique natural key - in EMP, do you really think that the first name and last name together are unique? What about Joe Smith Sr, and Joe Smith Jr? There is an Anglican Bishop in England named John Flack - not me.
ReplyDeleteRic Van Dyke: I sometimes make an exception for LOV table, with a Code/Description sort of structure. The codes are somewhat meaningful, but they ARE inventions, like US State postal abbreviations. This is because it is useful to see the code in the data table, and know what it means without having to do the join to the LOV table.
Hi John!
DeleteRoger that. There are always exceptions, but still we should be careful. That "little LOV" table might today be just that, but sometimes they grow into things we didn't foresee. If we are lucky as they change we can modify it as it does and keep sanity in the system.
- Ric
Key values change - it's a fact. An "artificial" key (whatever that word means) is in some circumstances less likely to change but you are trying to make a distinction between primary keys and other keys that in practice doesn't exist. It's incorrect to say that a primary key is the "cornerstone" of the relational model. The basis of the relational model is relations with keys (one or more key per relation). In relational terms there is no difference between one key and another. The primary-ness of keys is really a matter of convention and the convenience of software developers.
ReplyDelete