Re: [GENERAL] pervasiveness of surrogate (also called synthetic) keys

David Johnston Tue, 03 May 2011 21:01:14 -0700


On May 3, 2011, at 22:03, Greg Smith <g...@2ndquadrant.com> wrote:

> Merlin Moncure wrote:
>> If your data modeler that made the the assumptions that a MAC is
>> unique (a mistake obviously) at least the other tables are protected
>> from violations of that assumption because the database would reject
>> them with an error, which is a Good Thing.  Without a uniqueness
>> constraint you now have ambiguous data which is a Very Bad Thing.
>>  
> 
> With a uniqueness constraint in this situation, the unexpected data--row with 
> a non unique MAC--will be rejected and possibly lost when the insertion 
> happens.  You say that's a good thing, plenty of people will say that's the 
> worst possible thing that can happen.  When dealing with external data, it's 
> often impossible to know everything you're going to see later at design time. 
>  Approaching that problem with the idea that you're going to lose any data 
> that doesn't fit into the original model is not what everyone finds 
> reasonable behavior.
> 
> I don't think it's possible to decide in a generic way which of these is the 
> better approach:  to reject unexpected data and force the problem back at the 
> application immediately (commit failure), or to accept with with because 
> you're using a surrogate key and discover the problems down the line.  Both 
> are valid approaches with a very different type of risk associated with them. 
>  I think it's fair to say that real-world data is not always well known 
> enough at design time to follow the idea you're suggesting though, and that 
> does factor into why there is such a preference for surrogate keys in the 
> industry.
> 
> -- 
> Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
> 
> 
> -- 
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

If you implicitly expect MAC to be unique but do not enforce it then you are 
likely to write queries that rely upon that uniqueness.  If you do enforce it 
then the constraint failure occurs anyway.

A scalar sub-query that links via the MAC will fail when the duplicate data is 
encountered, and normal queries will return too-many records.  A detail report 
may be obvious but if you are summarizing the data the specific offending 
record is going to require some effort to find.

I guess if you are the compromising type you can loosely enforce the uniqueness 
by running a check query periodically to see if supposedly unique values have 
been duplicated.

I agree there is no right answer - the designer needs to make trade-offs - but 
I'd rather reject new data and leave the system in a status-quo stable state 
instead of introducing invalid data and putting the system into a state where 
it requires effort to get it functioning again.  If you accept the invalid data 
the likely scenario, if something breaks, is someone finds the offending record 
and removes it until the application and database can be fixed properly - which 
is where we are at with validation.  The common exception is where identifiers 
are reused over time and you remove the old record in order to keep/allow the 
newer record to remain.

On a tangential course I've started considering is a setup whereby you 
basically have two identifiers for a record.  One is end-user facing and 
updatable whereas the other is static and used in intra-table relations.  You 
can create a new record with the same user-facing id as an existing Id but the 
existing Id will be replaced with its system id.  This is useful when users 
will be using the Id often and it can be reasonably assumed to be unique over a 
moderate period of time (say a year).  Invoice numbers, customer numbers are 
two common examples.  The lookup Id itself may require additional fields in 
order to qualify as a primary (natural) key but the static key wants to be a 
single field.  Often simply putting a date with the original id (and parent 
identifiers) is sufficient due to the infrequency of updates.  The downside is, 
with string-based parent identifiers the pk value can be quite long.  I 
currently have PKs of 40-50 length but during my new design my first pass on a 
couple of tables indicated >100 characters limit.

Is there any rules-of-thumb on the performance of a PK as a function of key 
length?  I like using varchar based identifiers since I tend to query tables 
directly and writing where clauses is much easier if you can avoid the joins.  
I'm likely better off creating views and querying those but am still curious on 
any basic thoughts on having a 100+ length primary key.

David J.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] pervasiveness of surrogate (also called synthetic) keys

Reply via email to