date:20211107

Design database schemas around a common subset of objects

2021-11-07 Thread Erik Wienhold

I'm brainstorming ideas on how to design database schemas that share a
common subset of objects and stumbled on some issues. Maybe someone has
experience with such a design. Any ideas are welcome!

Use case:

My team wants a "template" database from which we can create databases for
new customer projects. Customer projects may have additional objects
building on top of the common objects provided by the template. Think of
the common objects as our core library.

Each customer project deals with geographic objects in the same business
domain and we already built prototypes for a few projects using PostGIS.
The core library will consist of tables, views, etc. that model entities
from the business domain.

Each project should have a dedicated database to isolate data and avoid
accidental mixing. Although I think it should be possible to handle all
projects in a single database by extending primary keys to include the
project ID and using table partitioning. But there's always the risk of
mixing different projects due to errors in WHERE clauses or JOIN conditions.
Of course this should be covered with tests, but shit happens.

Customers will not access the databases directly. Management just wants to
keep the data as isolated as possible.

My ideas:

I already ruled out Postgres' template databases because it just copies the
template database and there's no mechanism for updating the template copies
besides individually managing each copy's schema as if the copy was created
from scratch.

This led me to extensions because they are versioned and different versions
can be installed per database so each customer project can evolve at a
different pace. The extension for our common objects would also include
tables and I already verified that it's possible to have those dumped by
pg_dump [1] for our backups. And Postgres also ensures the integrity of
extensions by preventing the removal of objects created by extensions.

Issue with extensions:

At this point I noticed that pg_dump emits CREATE EXTENSION without the
version. Why is that? I assumed that pg_dump creates exact snapshots,
especially for the purpose of backups.

The documentation of pg_dump does not say anything about that so I checked
its source code and found that the version is omitted in favour of the
default extension version on the destination installation [2].

I can the see the risk of a specific extension version not being available
on a different installation but that would simply cause the restore to fail
on a non-existent extension version. This can be fixed by providing the
required extension versions on the destination installation.

I guess the intended solution for this is to create the database and install
the expected extension version before restoring the dump. That's why
pg_dump emits CREATE EXTENSION IF EXISTS. But this requires manual
intervention instead of taking the dump as is. Or recreate the schema from
version control and restore only the data.

Because of this issue I'm hesitant going with an extension for this use
case, especially once we have versions with backward-incompatible changes.
Instead I'm thinking about something similar to the update scripts we use
for extensions but checking them into version control for each customer
project.

Erik

[1]
https://www.postgresql.org/docs/14/extend-extensions.html#EXTEND-EXTENSIONS-CONFIG-TABLES
[2]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=a2dc42e278b0aca7d8074ebe66670b56099bab4d;hb=86a4dc1e6f29d1992a2afa3fac1a0b0a6e84568c#l10473

Re: Design database schemas around a common subset of objects

2021-11-07 Thread Adrian Klaver


On 11/6/21 17:02, Erik Wienhold wrote:

I'm brainstorming ideas on how to design database schemas that share a
common subset of objects and stumbled on some issues.  Maybe someone has
experience with such a design.  Any ideas are welcome!






Because of this issue I'm hesitant going with an extension for this use
case, especially once we have versions with backward-incompatible changes.
Instead I'm thinking about something similar to the update scripts we use
for extensions but checking them into version control for each customer
project.


My questions for this are:

1) How stable are the core objects?

2) How divergent are the customer specific requirements?

3) Given 1) and 2) would it be possible to craft extensions that where 
not customer specific?


4) Are you using or have you looked at Sqitch(https://sqitch.org/)?



Erik






--
Adrian Klaver
adrian.kla...@aklaver.com

Re: Design database schemas around a common subset of objects

2021-11-07 Thread Erik Wienhold

Hi Adrian,

> On 07/11/2021 18:38 Adrian Klaver  wrote:
>
> My questions for this are:
>
> 1) How stable are the core objects?

Hard to tell.  A lot of changes were necessary for the prototype, as
expected.  It's considered stable by some team members, who unfortunately
lack experience in database design and software development.

My main fear, from experience, are breaking changes to unique keys, e.g.
additional columns, that are likely to create issues in dependent views and
queries.

> 2) How divergent are the customer specific requirements?

Different projects may use a different subset of objects.  I don't mind
having unused objects, e.g. empty tables, in the database schemas.

I still have to find out what is expected to differ between projects.  But
it's impossible to anticipate every project and I bet there will be projects
that won't fit our model and require breaking changes.

One thing I already know that may differ is how modelled objects are named
in the real world.  We deal with plots of land, hence the geographic data.
Depending on the region, these plots may use different naming schemes
relevant to legal documents.  The core objects would use a generated
surrogate key but each project would have a dedicated relation that maps the
real-world names of those plots to their surrogate keys.  The real-world
names can be normalized into multiple attributes instead of just storing
names as text.  This normalization may vary between projects, e.g. different
number of attributes.

> 3) Given 1) and 2) would it be possible to craft extensions that where
> not customer specific?

That's the goal.  The core should not contain any project specifics.

Doesn't need to be Postgres extensions because of the issue with pg_dump
omitting the extension version which I described in my original post.

I'm thinking about something like merges in Git.  Database schemas for
project-a and project-b are like separate repositories and changes in core
are merged as necessary.  This along with project-specific changes ("o" in
the diagram below) should result in a sequence of migrations applied to the
respective databases.

project-a   oo--o--ooo-o---o--oo>
___/  _/ _/
   / /  /
core  o--oo-o---o-o--o-->
   \ \__   \
\   \   \
project-bo--ooo--o-o-o--ooo->

These merges (always from core to the projects) form a directed acyclic
graph from which the migration sequence can be generated using a topological
ordering.

> 4) Are you using or have you looked at Sqitch(https://sqitch.org/)?

Already looking into it after stumbling across it during research on the
mailing lists ;)

Erik

Re: Design database schemas around a common subset of objects

2021-11-07 Thread Rich Shepard


On Mon, 8 Nov 2021, Erik Wienhold wrote:


One thing I already know that may differ is how modelled objects are named
in the real world.  We deal with plots of land, hence the geographic data.
Depending on the region, these plots may use different naming schemes
relevant to legal documents.  The core objects would use a generated
surrogate key but each project would have a dedicated relation that maps the
real-world names of those plots to their surrogate keys.  The real-world
names can be normalized into multiple attributes instead of just storing
names as text. This normalization may vary between projects, e.g. different
number of attributes.


Erik,

How about a table 'plot' that holds the real-world name with an abbreviation
(as the PK) and other information relevant to plots? That seems to be what
you're describing above but explained differently. Land plots, like othter
geographic locations, always have surrogate keys because there's no natural
key associated with it. I use site_nbr, samp_nbr, and similar PKs because
there's nothing inherently associated with those attributes. Here in the US
using SSN (social security numbers) for people are a natural key as is the
VIN (vehicle identification number) for ... vehicles.

Rich

Re: Design database schemas around a common subset of objects

2021-11-07 Thread Rob Sargent




> On Nov 7, 2021, at 6:10 PM, Rich Shepard  for people are a natural key as is the
> VIN (vehicle identification number) for ... vehicles.
> 
> Rich

Be very careful with SSN. Not as unique as you might think. As well as a 
stretch to call natural. 
> 
>

Design database schemas around a common subset of objects

Re: Design database schemas around a common subset of objects

Re: Design database schemas around a common subset of objects

Re: Design database schemas around a common subset of objects

Re: Design database schemas around a common subset of objects

5 matches

Site Navigation

Mail list logo

Footer information