Design database schemas around a common subset of objects
I'm brainstorming ideas on how to design database schemas that share a common subset of objects and stumbled on some issues. Maybe someone has experience with such a design. Any ideas are welcome! Use case: My team wants a "template" database from which we can create databases for new customer projects. Customer projects may have additional objects building on top of the common objects provided by the template. Think of the common objects as our core library. Each customer project deals with geographic objects in the same business domain and we already built prototypes for a few projects using PostGIS. The core library will consist of tables, views, etc. that model entities from the business domain. Each project should have a dedicated database to isolate data and avoid accidental mixing. Although I think it should be possible to handle all projects in a single database by extending primary keys to include the project ID and using table partitioning. But there's always the risk of mixing different projects due to errors in WHERE clauses or JOIN conditions. Of course this should be covered with tests, but shit happens. Customers will not access the databases directly. Management just wants to keep the data as isolated as possible. My ideas: I already ruled out Postgres' template databases because it just copies the template database and there's no mechanism for updating the template copies besides individually managing each copy's schema as if the copy was created from scratch. This led me to extensions because they are versioned and different versions can be installed per database so each customer project can evolve at a different pace. The extension for our common objects would also include tables and I already verified that it's possible to have those dumped by pg_dump [1] for our backups. And Postgres also ensures the integrity of extensions by preventing the removal of objects created by extensions. Issue with extensions: At this point I noticed that pg_dump emits CREATE EXTENSION without the version. Why is that? I assumed that pg_dump creates exact snapshots, especially for the purpose of backups. The documentation of pg_dump does not say anything about that so I checked its source code and found that the version is omitted in favour of the default extension version on the destination installation [2]. I can the see the risk of a specific extension version not being available on a different installation but that would simply cause the restore to fail on a non-existent extension version. This can be fixed by providing the required extension versions on the destination installation. I guess the intended solution for this is to create the database and install the expected extension version before restoring the dump. That's why pg_dump emits CREATE EXTENSION IF EXISTS. But this requires manual intervention instead of taking the dump as is. Or recreate the schema from version control and restore only the data. Because of this issue I'm hesitant going with an extension for this use case, especially once we have versions with backward-incompatible changes. Instead I'm thinking about something similar to the update scripts we use for extensions but checking them into version control for each customer project. Erik [1] https://www.postgresql.org/docs/14/extend-extensions.html#EXTEND-EXTENSIONS-CONFIG-TABLES [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=a2dc42e278b0aca7d8074ebe66670b56099bab4d;hb=86a4dc1e6f29d1992a2afa3fac1a0b0a6e84568c#l10473
Re: Design database schemas around a common subset of objects
On 11/6/21 17:02, Erik Wienhold wrote: I'm brainstorming ideas on how to design database schemas that share a common subset of objects and stumbled on some issues. Maybe someone has experience with such a design. Any ideas are welcome! Because of this issue I'm hesitant going with an extension for this use case, especially once we have versions with backward-incompatible changes. Instead I'm thinking about something similar to the update scripts we use for extensions but checking them into version control for each customer project. My questions for this are: 1) How stable are the core objects? 2) How divergent are the customer specific requirements? 3) Given 1) and 2) would it be possible to craft extensions that where not customer specific? 4) Are you using or have you looked at Sqitch(https://sqitch.org/)? Erik -- Adrian Klaver adrian.kla...@aklaver.com
Re: Design database schemas around a common subset of objects
Hi Adrian, > On 07/11/2021 18:38 Adrian Klaver wrote: > > My questions for this are: > > 1) How stable are the core objects? Hard to tell. A lot of changes were necessary for the prototype, as expected. It's considered stable by some team members, who unfortunately lack experience in database design and software development. My main fear, from experience, are breaking changes to unique keys, e.g. additional columns, that are likely to create issues in dependent views and queries. > 2) How divergent are the customer specific requirements? Different projects may use a different subset of objects. I don't mind having unused objects, e.g. empty tables, in the database schemas. I still have to find out what is expected to differ between projects. But it's impossible to anticipate every project and I bet there will be projects that won't fit our model and require breaking changes. One thing I already know that may differ is how modelled objects are named in the real world. We deal with plots of land, hence the geographic data. Depending on the region, these plots may use different naming schemes relevant to legal documents. The core objects would use a generated surrogate key but each project would have a dedicated relation that maps the real-world names of those plots to their surrogate keys. The real-world names can be normalized into multiple attributes instead of just storing names as text. This normalization may vary between projects, e.g. different number of attributes. > 3) Given 1) and 2) would it be possible to craft extensions that where > not customer specific? That's the goal. The core should not contain any project specifics. Doesn't need to be Postgres extensions because of the issue with pg_dump omitting the extension version which I described in my original post. I'm thinking about something like merges in Git. Database schemas for project-a and project-b are like separate repositories and changes in core are merged as necessary. This along with project-specific changes ("o" in the diagram below) should result in a sequence of migrations applied to the respective databases. project-a oo--o--ooo-o---o--oo> ___/ _/ _/ / / / core o--oo-o---o-o--o--> \ \__ \ \ \ \ project-bo--ooo--o-o-o--ooo-> These merges (always from core to the projects) form a directed acyclic graph from which the migration sequence can be generated using a topological ordering. > 4) Are you using or have you looked at Sqitch(https://sqitch.org/)? Already looking into it after stumbling across it during research on the mailing lists ;) Erik
Re: Design database schemas around a common subset of objects
On Mon, 8 Nov 2021, Erik Wienhold wrote: One thing I already know that may differ is how modelled objects are named in the real world. We deal with plots of land, hence the geographic data. Depending on the region, these plots may use different naming schemes relevant to legal documents. The core objects would use a generated surrogate key but each project would have a dedicated relation that maps the real-world names of those plots to their surrogate keys. The real-world names can be normalized into multiple attributes instead of just storing names as text. This normalization may vary between projects, e.g. different number of attributes. Erik, How about a table 'plot' that holds the real-world name with an abbreviation (as the PK) and other information relevant to plots? That seems to be what you're describing above but explained differently. Land plots, like othter geographic locations, always have surrogate keys because there's no natural key associated with it. I use site_nbr, samp_nbr, and similar PKs because there's nothing inherently associated with those attributes. Here in the US using SSN (social security numbers) for people are a natural key as is the VIN (vehicle identification number) for ... vehicles. Rich
Re: Design database schemas around a common subset of objects
> On Nov 7, 2021, at 6:10 PM, Rich Shepard for people are a natural key as is the > VIN (vehicle identification number) for ... vehicles. > > Rich Be very careful with SSN. Not as unique as you might think. As well as a stretch to call natural. > >