date:20250206

Hi Guillaume,

> On 6 Feb 2025, at 11:13, Guillaume Lelarge  
> wrote:
> 
> You probably don't need --disable-triggers. You should fix errors in the 
> order they appear. The first one is on the drop of the database:
> 
> ERROR:  cannot drop the currently open database
> 
> pg_restore can't drop the database because it's connected to the database. 
> When you use -c and -C options, you can't connect to the database you want to 
> restore to. You have to connect to another database, such as postgres, so 
> that it can do the drop and the create. After both are done, it will connect 
> to the just-created database to do the restore step.
> 
> Look at the pg_restore man page 
> (https://www.postgresql.org/docs/current/app-pgrestore.html). It says on the 
> --create option:
> 
> When this option is used, the database named with -d is used only to issue 
> the initial DROP DATABASE and CREATE DATABASE commands. All data is restored 
> into the database name that appears in the archive.

This is intended because the dump contains a create database statement which 
creates the database with libc which is exactly what I do NOT want. I want it 
to be a libicu database. So I pre-create it as such and inhibit recreation by 
pg_restore by sitting on it with a session. So the first message about the 
database not being created is expected and can be ignored. This works fine for 
all databases so far.

My problem is the constraint violation which inhibits the foreign key 
contraints from being created.

Everything works for all databases. Only this one has that problem. And since I 
disabled triggers during restore, that shouldn't be a problem either.

Btw., the parent table contains the rows in question. So they are imported. I 
just can't make out why there is a problem.

Cheers,
Paul

libc to libicu via pg_dump/pg_restore?

Hi,

I have a problem which I don't understand. I have and do:


instance a, libc based, PostgreSQL 15.10:

mydb=# \l mydb
  List of databases
 Name |  Owner  | Encoding |   Collate   |Ctype| ICU Locale | Locale 
Provider | Access privileges
--+-+--+-+-++-+---
 mydb | my_user | UTF8 | en_US.UTF-8 | en_US.UTF-8 || libc  
  |

$ pg_dump -Fc -Z1 -b mydb -f mydb.dump.gz
$ ls -l mydb.dump.gz
-rw--- 1 postgres postgres 14660308577 Feb  6 08:45 mydb.dump.gz


instance b, libicu based, PostgreSQL 17.2:
$ psql postgres

# create database mydb;
# \l mydb
  List of databases
 Name |   Owner  | Encoding | Locale Provider |   Collate   |Ctype| 
Locale | ICU Rules | Access privileges
--+--+--+-+-+-++---+---
 mydb | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | 
en-US  |   |

$ pg_restore -cC --if-exists --disable-triggers -d mydb mydb.dump.gz
pg_restore: error: could not execute query: ERROR:  cannot drop the currently 
open database
Command was: DROP DATABASE IF EXISTS mydb;
pg_restore: error: could not execute query: ERROR:  database "mydb" already 
exists
Command was: CREATE DATABASE mydb WITH TEMPLATE = template0 ENCODING = 'UTF8' 
LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8';


pg_restore: error: could not execute query: ERROR:  insert or update on table 
"table_1" violates foreign key constraint "..._fk"
DETAIL:  Key (dokument_id)=(133680) is not present in table "...".
Command was: ALTER TABLE ONLY myschema.table
   ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES 
myschema.dokument(id);


pg_restore: error: could not execute query: ERROR:  insert or update on table 
"table_2" violates foreign key constraint "..._fk"
DETAIL:  Key (dokument_id)=(133740) is not present in table "dokument".
Command was: ALTER TABLE ONLY vostra2_str.nen_dokument
   ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES 
myschema.dokument(id);


I'm sorry, I sort of had to anonymize object names. But you should be able to 
get the gist of it. It's a dreaded message when importing. My goal is to export 
libc PostgreSQL 15 databases and import them into PostgreSQL 17 as libicu based 
databases to get away from glibc based sorting. I searched the net to find the 
"--disable-triggers" disable triggers when running pg_restore but the errors 
still occur.

What am I doing wrong or how can I better achieve that? Any help would be 
appreciated.

Thanks in advance.

Paul

Re: libc to libicu via pg_dump/pg_restore?

2025-02-06 Thread Guillaume Lelarge


Hi,

On 06/02/2025 10:04, Paul Foerster wrote:

Hi,

I have a problem which I don't understand. I have and do:


instance a, libc based, PostgreSQL 15.10:

mydb=# \l mydb
   List of databases
  Name |  Owner  | Encoding |   Collate   |Ctype| ICU Locale | Locale 
Provider | Access privileges
--+-+--+-+-++-+---
  mydb | my_user | UTF8 | en_US.UTF-8 | en_US.UTF-8 || libc 
   |

$ pg_dump -Fc -Z1 -b mydb -f mydb.dump.gz
$ ls -l mydb.dump.gz
-rw--- 1 postgres postgres 14660308577 Feb  6 08:45 mydb.dump.gz


instance b, libicu based, PostgreSQL 17.2:
$ psql postgres

# create database mydb;
# \l mydb
   List of databases
  Name |   Owner  | Encoding | Locale Provider |   Collate   |Ctype| 
Locale | ICU Rules | Access privileges
--+--+--+-+-+-++---+---
  mydb | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | 
en-US  |   |

$ pg_restore -cC --if-exists --disable-triggers -d mydb mydb.dump.gz
pg_restore: error: could not execute query: ERROR:  cannot drop the currently 
open database
Command was: DROP DATABASE IF EXISTS mydb;
pg_restore: error: could not execute query: ERROR:  database "mydb" already 
exists
Command was: CREATE DATABASE mydb WITH TEMPLATE = template0 ENCODING = 'UTF8' 
LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8';


pg_restore: error: could not execute query: ERROR:  insert or update on table "table_1" 
violates foreign key constraint "..._fk"
DETAIL:  Key (dokument_id)=(133680) is not present in table "...".
Command was: ALTER TABLE ONLY myschema.table
ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES 
myschema.dokument(id);


pg_restore: error: could not execute query: ERROR:  insert or update on table "table_2" 
violates foreign key constraint "..._fk"
DETAIL:  Key (dokument_id)=(133740) is not present in table "dokument".
Command was: ALTER TABLE ONLY vostra2_str.nen_dokument
ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES 
myschema.dokument(id);


I'm sorry, I sort of had to anonymize object names. But you should be able to get the 
gist of it. It's a dreaded message when importing. My goal is to export libc PostgreSQL 
15 databases and import them into PostgreSQL 17 as libicu based databases to get away 
from glibc based sorting. I searched the net to find the "--disable-triggers" 
disable triggers when running pg_restore but the errors still occur.

What am I doing wrong or how can I better achieve that? Any help would be 
appreciated.



You probably don't need --disable-triggers. You should fix errors in the 
order they appear. The first one is on the drop of the database:


ERROR:  cannot drop the currently open database

pg_restore can't drop the database because it's connected to the 
database. When you use -c and -C options, you can't connect to the 
database you want to restore to. You have to connect to another 
database, such as postgres, so that it can do the drop and the create. 
After both are done, it will connect to the just-created database to do 
the restore step.


Look at the pg_restore man page 
(https://www.postgresql.org/docs/current/app-pgrestore.html). It says on 
the --create option:


When this option is used, the database named with -d is used only to 
issue the initial DROP DATABASE and CREATE DATABASE commands. All data 
is restored into the database name that appears in the archive.


Regards.


--
Guillaume Lelarge
Consultant
https://dalibo.com

Re: How to perform a long running dry run transaction without blocking

> Have you considered a validation app? Have it read the inputs and look in db 
> for conflicts, rather than attempt the insert. Zero transactions necessary


I did consider that about a year or two ago when I first conceived the data 
validation interface.  Doing that now would solve the problem of concurrent 
validations blocking one another, but selects would still get blocked if their 
result is pending the outcome of a load running in the background.  If I 
mitigate that issue by running those loads over night on a schedule, I still 
lose out on the benefits of having the loading code do the validation for me...

I would have to explicitly find and report on problems that the load exceptions 
currently do for me, without any extra code.

So what I'm saying is that the data being validated is inter-dependent.  There 
are about a dozen different loading scripts (one for each sheet of the uploaded 
excel file) whose runs are orchestrated by a master load script that ensures 
they are run in the right order so that the interdependent data can be checked. 
 For example, these are some relative orders of what needs to be loaded so that 
data can be validated:

Study > Animals
Tracers > Infusates > Animals
Treatments > Animals
Tissues > Samples

The Animal load script would fail if the new data in (the tables) Study, 
Tracers, Infusates, and Treatments aren't inserted, because it links to those 
newly created records.  And there's no way to detect problems in those new 
relationships in the unchanged database if they aren't inserted.  That's what 
doing this all in a transaction, and actually doing the inserts (for which I 
use Django `get_or_create` method calls) provides.

In other words, I would have to save and explicitly check the inter-related 
sheet data in data structures independent of the database in order to find the 
equivalent of (for example) `ObjectDoesNotExist` errors that originate from the 
database.  Right now, I get those errors caught "for free".  All I have to do 
is tell the user what sheet/row/column is related to that error.  And it saves 
me the overhead of having to maintain synchronicity between separate validation 
code and loading code when the loading code changes.

Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544

Re: libc to libicu via pg_dump/pg_restore?


On 2/6/25 09:37, Paul Foerster wrote:

Hi Guillaume,


On 6 Feb 2025, at 15:51, Guillaume Lelarge  wrote:

You're right. Now I see the "create database" query in your previous email. I 
should have been more careful, sorry for the noise.


No problem.


Well, the doc says that --disable-triggers is only relevant for data-only 
restore, which is not your use case. So you don't need it and it won't help you.


Yes, I found that out too. But it doesn't hurt. 🤣


Me neither. But another comment. You create the database, so there should be no 
objects in it. Why do you use the -c, -C, and --if-exists options? Try without 
them. On a new database, you should only need:

pg_restore -d mydb mydb.dump.gz


I need -C because I need ACLs to be recreated too. I tried with -C only, i.e. 
no -c but that doesn't work for some reason. The --if-exists is a script 
remnant of my past tries to suppress some messages. I'll try removing that as I 
rewrote my create database script which runs before importing.


By ACL do you mean roles?

If so roles are global to the cluster not the database, so I am not 
seeing -C being relevant.


If not you will need to be more specific about what you are referring to.



Cheers,
Paul



--
Adrian Klaver
adrian.kla...@aklaver.com

Re: Lookup tables

2025-02-06 Thread Peter J. Holzer

On 2025-02-04 22:41:38 +0100, Thiemo Kellner wrote:
> 04.02.2025 18:31:09 Michał Kłeczek :
> 
> > 
> >> On 4 Feb 2025, at 18:27, Thiemo Kellner  
> >> wrote:
> >> 
> >>  Unless the lookup table is actually a check constraint one
> >> can use to populate dropdown boxes in an interface.
> > 
> > That is even worse because it ceases being transactional and users
> > might select something different than what they see on the screen.
> 
> I might see what you want to point out. E.g. the table is COLOURS. The
> rec with id 1 is RED, the one with id 2 is BLUE, 3 is GREE and so on.
> Now you load these values into the dropdown box that sports RED, BLUE,
> GREE and so on. While someone selects GREE, there is a maintenance
> release changing GREE to YELLOW. So when that someone sends the
> selection by id to the backend, not GREE is selected but YELLOW.

I fail to see why use of a surrogate key is the problem here.

Either changing the color from GREE to YELLOW makes sense or it doesn't.

If it doesn't make sense, then it's release which is faulty, not the
model.

if it does make sense (I'm a bit at a loss when that might be the case,
maybe the "color" is just a code word, or maybe they are colors in a
design which are arbitrary but must be consistent), then the experience
that the user has is exactly the same as if the maintenance release was
applied just after they selected the color. Which might be a bit
confusing but is almost certainly what is wanted.

> A) Your release changed the sementics of the record 3. It's meaning
> changed. I cannot recommend doing that.

If the release changed the semantics of an existing record the release
was almost certainly wrong.

> B) If you absolutely must change the semantic, put your application
> into maintenance mode in which noone can select anything beforehand.
> 
> If the maintenance would just correct the typo from GREE to GREEN,
> nothing would happen. Yor customer still ordered the lavishly green
> E-Bike her hear ever desired.

Yeah, that's a good example where changing the color from GREE to YELLOW
doesn't make sense. Presumably that ID 3 is used as a foreign key in
lots of places, e,g. in an inventory table. Your bikes in stock won't
just magically change color just because you changed some text in the
database. So that change simply doesn't make sense and shouldn't be done
as part of a maintenance release. Confusing a few people who just happen
to open the dropdown in the wrong second is the least of your problems.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature

Re: How to perform a long running dry run transaction without blocking

>>> The load to the development server does no validation?
>>> 
>>> If so what is the purpose?
>>> 
>>> The background processes are other validation runs?
>> It's the same code that executes in both cases (with or without the 
>> `--validate` flag).  All that that flag does is it (effectively) raises the 
>> dry run exception before it leaves the transaction block, so it always 
>> validates (whether the flag is supplied or not).
> 
> More for my sake then anything else, why do the load to the development 
> server at all if the production load is the only one that counts?

The software is still in a new major version beta.  We're adding features and 
fixing bugs.  It's not unusual to encounter a new bug, fix it on dev to get the 
load to work, then deploy a point release on prod.  And that means repeated 
load attempts that interfere with the validation interface.  Besides, beyond 
this, we're planning on a separate staging database that dev effectively now 
is.  Sometimes, a curator only finds a technical data issue after the initial 
load while browsing the newly loaded data on the dev site.

>> So the load doesn't fail until the end of the run, which is inefficient from 
>> a maintenance perspective.  I've been thinking of adding a `--failfast` 
>> option for use on the back end.  Haven't done it yet.  I started a load 
>> yesterday in fact that ran 2 hours before it buffered an exception related 
>> to a newly introduced bug.  I fixed the bug and ran the load again.  It 
>> finished sometime between COB yesterday and this morning (successfully!).
> 
> Alright I am trying to reconcile this with from below, 'The largest studies 
> take just under a minute'.

The context of the 'The largest studies take just under a minute' statement is 
that it's not loading the hefty/time-consuming raw data.  It's only validating 
the metadata.  That's fast (5-60s).  And that data is a portion of the 
transaction in the back-end load.  There are errors that validation can miss 
that are due to not touching the raw data, and in fact, those errors are 
addressed by curators editing the excel sheets.  That's why it's all in the 
load transaction instead of loaded separately, but those problems are somewhat 
rare (and we currently have a new feature in the design phase that should 
almost completely eliminate those issues).

>>> Seems you are looking for some sort of queuing system.
>>> 
>>> What are the time constraints for getting the validation turned around.
>> I have considered a queuing system, though when I previously floated a proof 
>> of concept using celery, I was informed it was too much.  Though, at the 
>> time, all I was trying to do was a progress bar for a query stats feature.  
>> So proposing celery in this instance may get more traction with the rest of 
>> the team.
>> Most of the small validation processes finish in under a dozen seconds.   
>> The largest studies take just under a minute.  I have plans to optimize the 
>> loading scripts that hopefully could get the largest studies down to a dozen 
>> seconds.  If I could do that, and do the back end loads in off-peak hours, 
>> then I'd be willing to suffer the rare timeouts from concurrent validations. 
>>  The raw data loads will still likely take a much longer time.
> 
> This is where I get confused, probably because I am not exactly sure what 
> constitutes validation. My sense is that involves a load of data into live 
> tables and seeing what fails PK, FK or other constraints.
> 
> If that is the case I am not seeing how the 'for real' data load would be 
> longer?

The validation skips the time-consuming raw data load.  That raw data is 
collectively hundreds of gigs in size and could not be uploaded on the 
validation page anyway.  The feature I alluded to above that would make errors 
associated with the raw data almost completely eliminated is one where the 
researcher can drop the raw data folder into the form and it just walks the 
directory to get all the raw data file names and relative paths.  It's those 
data relationships whose validations are currently skipped.

> At any rate I can't see how loading into a live database multiple sets of 
> data while operations are going on in the database can be made conflict free. 
> To me  it seems the best that be done is:
> 
> 1) Reduce chance for conflict by spreading the actions out.
> 
> 2) Have retry logic that deals with conflicts.

I'm unfamiliar with retry functionality, but those options sound logical to me 
as a good path forward, particularly using celery to spread out validations and 
doing the back end loads at night (or using some sort of fast dump/load).  The 
thing that bothers me about the celery solution is that most of the time, 2 
users validating different data will not block, so I would be making users wait 
for no reason.  Ideally, I could anticipate the block and only at that point, 
separate those validations.

This brings up a question though about a possibility I sus

Re: libc to libicu via pg_dump/pg_restore?

Hi Guillaume,

> On 6 Feb 2025, at 15:51, Guillaume Lelarge  
> wrote:
> 
> You're right. Now I see the "create database" query in your previous email. I 
> should have been more careful, sorry for the noise.

No problem.

> Well, the doc says that --disable-triggers is only relevant for data-only 
> restore, which is not your use case. So you don't need it and it won't help 
> you.

Yes, I found that out too. But it doesn't hurt. 🤣

> Me neither. But another comment. You create the database, so there should be 
> no objects in it. Why do you use the -c, -C, and --if-exists options? Try 
> without them. On a new database, you should only need:
> 
> pg_restore -d mydb mydb.dump.gz

I need -C because I need ACLs to be recreated too. I tried with -C only, i.e. 
no -c but that doesn't work for some reason. The --if-exists is a script 
remnant of my past tries to suppress some messages. I'll try removing that as I 
rewrote my create database script which runs before importing.

Cheers,
Paul

Re: libc to libicu via pg_dump/pg_restore?

2025-02-06 Thread Guillaume Lelarge


Hi Paul,

On 06/02/2025 12:20, Paul Foerster wrote:

Hi Guillaume,


On 6 Feb 2025, at 11:13, Guillaume Lelarge  wrote:

You probably don't need --disable-triggers. You should fix errors in the order 
they appear. The first one is on the drop of the database:

ERROR:  cannot drop the currently open database

pg_restore can't drop the database because it's connected to the database. When 
you use -c and -C options, you can't connect to the database you want to 
restore to. You have to connect to another database, such as postgres, so that 
it can do the drop and the create. After both are done, it will connect to the 
just-created database to do the restore step.

Look at the pg_restore man page 
(https://www.postgresql.org/docs/current/app-pgrestore.html). It says on the 
--create option:

When this option is used, the database named with -d is used only to issue the 
initial DROP DATABASE and CREATE DATABASE commands. All data is restored into 
the database name that appears in the archive.


This is intended because the dump contains a create database statement which 
creates the database with libc which is exactly what I do NOT want. I want it 
to be a libicu database. So I pre-create it as such and inhibit recreation by 
pg_restore by sitting on it with a session. So the first message about the 
database not being created is expected and can be ignored. This works fine for 
all databases so far.



You're right. Now I see the "create database" query in your previous 
email. I should have been more careful, sorry for the noise.



My problem is the constraint violation which inhibits the foreign key 
contraints from being created.

Everything works for all databases. Only this one has that problem. And since I 
disabled triggers during restore, that shouldn't be a problem either.



Well, the doc says that --disable-triggers is only relevant for 
data-only restore, which is not your use case. So you don't need it and 
it won't help you.



Btw., the parent table contains the rows in question. So they are imported. I 
just can't make out why there is a problem.



Me neither. But another comment. You create the database, so there 
should be no objects in it. Why do you use the -c, -C, and --if-exists 
options? Try without them. On a new database, you should only need:


pg_restore -d mydb mydb.dump.gz

Less options, less weird behaviours.


--
Guillaume Lelarge
Consultant
https://dalibo.com

Re: How to perform a long running dry run transaction without blocking

> Please reply to list also.
> Ccing list

👍🏻

>> They enter the data in an excel spreadsheet containing about a dozen 
>> inter-related worksheets, named, for example: Study, Animals, Samples, 
>> Tissues, Treatments, Sequences
> 
> Long term is there a thought to have them enter directly into database where 
> validation can happen in real time

No, not really.  Sample metadata tends to be an afterthought to researchers.  
They have it in their notebooks and getting them to enter it at all is like 
pulling teeth.  The validation interface actually has a bunch of other features 
I haven't mentioned that streamline the process for them.  Before it gets to 
actually validating the data, it tries to lighten the manual burden on the 
researchers (and help with consistent nomenclature) by pulling sample names out 
of the raw files, massaging them, and filling those in along with a mass of 
common data that is used to populate drop-downs in the excel columns to avoid 
researcher typos and value variants.

Having everything work with excel actually made the site more attractive to the 
researchers, because they're comfortable with it and use it already, so it 
lowered the bar for using our software.

Besides, we don't trust the users enough to enter data unsupervised.  There are 
a lot of aspects of the data that cannot be automatically validated and involve 
experimental parameters that are adjacent to the purpose of our site.  We have 
curators that need to look at everything to ensure consistency, and looking at 
all the data in context is necessary before any of it is entered.

That said, back in the aughts, I wrote a perl cgi site for a toxin and 
virulence factor database that used a web interface for data entry and achieved 
the curation goal by saving a form of all inter-related data.  The submit 
button sent that form to a list of curators who could approve the insert/update 
and make it actually happen.  I think I had actually suggested that form of 
data entry when this current project first started, but I was overruled.  
However, in this project, the equivalent procedure would be per-sample, and 
you'd lose out on the overall context.  It's an interesting challenge, but I 
think we're pretty committed now on this file load path.

>>> Where are the background processes loading data to?
>> We first run our loads on a development server with a separate copy of the 
>> database, but then we re-run those same loads on the production server, 
>> where users perform their validations.
> 
> The load to the development server does no validation?
> 
> If so what is the purpose?
> 
> The background processes are other validation runs?

It's the same code that executes in both cases (with or without the 
`--validate` flag).  All that that flag does is it (effectively) raises the dry 
run exception before it leaves the transaction block, so it always validates 
(whether the flag is supplied or not).

So the load doesn't fail until the end of the run, which is inefficient from a 
maintenance perspective.  I've been thinking of adding a `--failfast` option 
for use on the back end.  Haven't done it yet.  I started a load yesterday in 
fact that ran 2 hours before it buffered an exception related to a newly 
introduced bug.  I fixed the bug and ran the load again.  It finished sometime 
between COB yesterday and this morning (successfully!).

>> One of the thoughts I'd had to work around the problem was to somehow dump 
>> the data from the development load and load it onto production in some sort 
>> of scheduled downtime or something.  However, even if we do that, I'm 
>> concerned that multiple users concurrently validating different submissions 
>> would encounter this blocking issue, and since those validations can take 
>> (on the upper end) just under a minute, it's enough for at least 1 user to 
>> encounter a timeout.  I have not yet proven that can happen, but based on my 
>> reading of the postgres documentation, it seems logical.
> 
> Seems you are looking for some sort of queuing system.
> 
> What are the time constraints for getting the validation turned around.

I have considered a queuing system, though when I previously floated a proof of 
concept using celery, I was informed it was too much.  Though, at the time, all 
I was trying to do was a progress bar for a query stats feature.  So proposing 
celery in this instance may get more traction with the rest of the team.

Most of the small validation processes finish in under a dozen seconds.  The 
largest studies take just under a minute.  I have plans to optimize the loading 
scripts that hopefully could get the largest studies down to a dozen seconds.  
If I could do that, and do the back end loads in off-peak hours, then I'd be 
willing to suffer the rare timeouts from concurrent validations.  The raw data 
loads will still likely take a much longer time.


Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institu

Re: How to perform a long running dry run transaction without blocking


On 2/6/25 09:58, Robert Leach wrote:

They enter the data in an excel spreadsheet containing about a dozen 
inter-related worksheets, named, for example: Study, Animals, 
Samples, Tissues, Treatments, Sequences


Long term is there a thought to have them enter directly into database 
where validation can happen in real time


Having everything work with excel actually made the site more attractive 
to the researchers, because they're comfortable with it and use it 
already, so it lowered the bar for using our software.


[...]

 It's an interesting challenge, but I think we're 
pretty committed now on this file load path.


Ok, that is generally how my discussions on this subject end. It's my 
windmill.



The load to the development server does no validation?

If so what is the purpose?

The background processes are other validation runs?


It's the same code that executes in both cases (with or without the 
`--validate` flag).  All that that flag does is it (effectively) raises 
the dry run exception before it leaves the transaction block, so it 
always validates (whether the flag is supplied or not).


More for my sake then anything else, why do the load to the development 
server at all if the production load is the only one that counts?



So the load doesn't fail until the end of the run, which is inefficient 
from a maintenance perspective.  I've been thinking of adding a 
`--failfast` option for use on the back end.  Haven't done it yet.  I 
started a load yesterday in fact that ran 2 hours before it buffered an 
exception related to a newly introduced bug.  I fixed the bug and ran 
the load again.  It finished sometime between COB yesterday and this 
morning (successfully!).


Alright I am trying to reconcile this with from below, 'The largest 
studies take just under a minute'.




Seems you are looking for some sort of queuing system.

What are the time constraints for getting the validation turned around.


I have considered a queuing system, though when I previously floated a 
proof of concept using celery, I was informed it was too much.  Though, 
at the time, all I was trying to do was a progress bar for a query stats 
feature.  So proposing celery in this instance may get more traction 
with the rest of the team.


Most of the small validation processes finish in under a dozen seconds. 
  The largest studies take just under a minute.  I have plans to 
optimize the loading scripts that hopefully could get the largest 
studies down to a dozen seconds.  If I could do that, and do the back 
end loads in off-peak hours, then I'd be willing to suffer the rare 
timeouts from concurrent validations.  The raw data loads will still 
likely take a much longer time.


This is where I get confused, probably because I am not exactly sure 
what constitutes validation. My sense is that involves a load of data 
into live tables and seeing what fails PK, FK or other constraints.


If that is the case I am not seeing how the 'for real' data load would 
be longer?


At any rate I can't see how loading into a live database multiple sets 
of data while operations are going on in the database can be made 
conflict free. To me  it seems the best that be done is:


1) Reduce chance for conflict by spreading the actions out.

2) Have retry logic that deals with conflicts.





Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544



--
Adrian Klaver
adrian.kla...@aklaver.com

Re: How to perform a long running dry run transaction without blocking

2025-02-06 Thread Rob Sargent



> 
Have you considered a validation app? Have it read the inputs and look in db 
for conflicts, rather than attempt the insert. Zero transactions necessary

Re: libc to libicu via pg_dump/pg_restore?


On 2/6/25 01:04, Paul Foerster wrote:

Hi,



Comments inline.


I have a problem which I don't understand. I have and do:


instance a, libc based, PostgreSQL 15.10:

mydb=# \l mydb
   List of databases
  Name |  Owner  | Encoding |   Collate   |Ctype| ICU Locale | Locale 
Provider | Access privileges
--+-+--+-+-++-+---
  mydb | my_user | UTF8 | en_US.UTF-8 | en_US.UTF-8 || libc 
   |

$ pg_dump -Fc -Z1 -b mydb -f mydb.dump.gz
$ ls -l mydb.dump.gz
-rw--- 1 postgres postgres 14660308577 Feb  6 08:45 mydb.dump.gz


instance b, libicu based, PostgreSQL 17.2:
$ psql postgres

# create database mydb;
# \l mydb
   List of databases
  Name |   Owner  | Encoding | Locale Provider |   Collate   |Ctype| 
Locale | ICU Rules | Access privileges
--+--+--+-+-+-++---+---
  mydb | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | 
en-US  |   |

$ pg_restore -cC --if-exists --disable-triggers -d mydb mydb.dump.gz


I would start by doing:

1) Log into postgres database and do:

 a) DROP DATABASE mydb;
 b) CREATE DATABASE mydb ;

2) pg_restore -d mydb mydb.dump.gz



pg_restore: error: could not execute query: ERROR:  insert or update on table "table_1" 
violates foreign key constraint "..._fk"
DETAIL:  Key (dokument_id)=(133680) is not present in table "...".


Is dokument_id an integer field?


Command was: ALTER TABLE ONLY myschema.table
ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES 
myschema.dokument(id);


pg_restore: error: could not execute query: ERROR:  insert or update on table "table_2" 
violates foreign key constraint "..._fk"
DETAIL:  Key (dokument_id)=(133740) is not present in table "dokument".
Command was: ALTER TABLE ONLY vostra2_str.nen_dokument
ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES 
myschema.dokument(id);






What am I doing wrong or how can I better achieve that? Any help would be 
appreciated.


In a follow post you said:

"Everything works for all databases. Only this one has that problem."

Do you mean you made the same libc --> icu change on the other databases 
with no errors?




Thanks in advance.

Paul



--
Adrian Klaver
adrian.kla...@aklaver.com

Re: How to perform a long running dry run transaction without blocking

On 2/6/25 08:53, Robert Leach wrote:

Great questions! Responses inline...

Please reply to list also.
Ccing list

Load to where existing table or temporary table?

Existing tables. Note that (and this is a point of contention in our team), we
have some fields that must be universally unique across all study data, so if
we were to employ temporary (empty) tables, we would miss out on the uniqueness
checks.

That is another discussion.

Background: The previous solution for this validation interface that I had tried, used a partial
copy of the database which I called the "validation" database. (I'd been forced to do it
that way because previous developers hadn't used transactions and their "dry run" mode
had side effects that I didn't want to happen due to the validation interface, so instead of engage
in a massive refactor, I tried using a separate database as a quick temporary fix.) But mind you,
all of the code I was writing was in django (python), and in order to maintain 2 different
databases and stay database architecture agnostic, I learned that your code must be littered with
hundreds of insertions of `.using()` (and other) statements, and it turned out that that strategy
is not 100% compatible with every kind of Django ORM thing you can do (e.g. `full_clean()`), so I'd
had to dig around in django core code to ensure every operation was being performed on the correct
database. It was a nightmare to maintain and I happily ripped it all out when I corrected the
original problems by wrapping everything in a transaction.

Yeah, reason why I bypass the ORM.

This validation interface skips the raw data load step, which is the heftiest,
most long running, part and generally finishes in well under a minute.

So what is the validation actually doing?

The raw data is pretty solid. It is automatically generated by hardware (mass
spectrometer) and software (peak analysis suites). So it doesn't need much (if
any) validation. What needs validation is all the metadata associated with the
samples that is totally generated by hand by the researchers.

They enter the data in an excel spreadsheet containing about a dozen inter-related worksheets, named, for example: Study, Animals, Samples, Tissues, Treatments, Sequences

Well there's your problem:) I will spare you my full spreadsheet rant.

Long term is there a thought to have them enter directly into database
where validation can happen in real time

Where are the background processes loading data to?

We first run our loads on a development server with a separate copy of the
database, but then we re-run those same loads on the production server, where
users perform their validations.

The load to the development server does no validation?

If so what is the purpose?

The background processes are other validation runs?

One of the thoughts I'd had to work around the problem was to somehow dump the
data from the development load and load it onto production in some sort of
scheduled downtime or something. However, even if we do that, I'm concerned
that multiple users concurrently validating different submissions would
encounter this blocking issue, and since those validations can take (on the
upper end) just under a minute, it's enough for at least 1 user to encounter a
timeout. I have not yet proven that can happen, but based on my reading of the
postgres documentation, it seems logical.

Seems you are looking for some sort of queuing system.

What are the time constraints for getting the validation turned around.

Not that I know of and that would be a false promise anyway as I know of no
process that is perfect.

I realize that it's a false promise WRT the background load transaction, but
it's a moot concern from the perspective of the validation page, because it is
making a guaranteed promise that it will never commit. All I want is for it to
not be blocked so I can report as many errors as I can to the researcher so
they can advance their submission compilation. besides, there's a pretty solid
bet that since the load succeeded on the development server, it will succeed on
the production server where this block would happen.

So if the load transaction does fail, and the data the validation process was
waiting on (to see the outcome) is not reported as problematic to the user, it
will be reported as problematic to those responsible for the load on the
back-end, so it gets dealt with either way.

--
Adrian Klaver
adrian.kla...@aklaver.com

Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544

--
Adrian Klaver
adrian.kla...@aklaver.com

Re: libc to libicu via pg_dump/pg_restore?

Hi Adrian,

> On 6 Feb 2025, at 17:31, Adrian Klaver  wrote:
> 
> 1) Log into postgres database and do:
> 
> a) DROP DATABASE mydb;
> b) CREATE DATABASE mydb ;
> 
> 2) pg_restore -d mydb mydb.dump.gz

With create database  being "template template0", this is what my 
script does. But I need the -cC options for pg_restore to get ACLs back. 
Leaving out either one of them will not get me the ACLs back.

>> pg_restore: error: could not execute query: ERROR:  insert or update on 
>> table "table_1" violates foreign key constraint "..._fk"
>> DETAIL:  Key (dokument_id)=(133680) is not present in table "...".
> 
> Is dokument_id an integer field?

Yes, it's a bigint.

> In a follow post you said:
> 
> "Everything works for all databases. Only this one has that problem."
> 
> Do you mean you made the same libc --> icu change on the other databases with 
> no errors?

Yes, I have that PostgreSQL 15.10 cluster with locale provider libc with about 
40 databases. I initdb'ed a new PostgreSQL 17.2 cluster with icu as locale 
provider and did a "create database ... template template0" for all about 40 
databases. Then I did the mentioned pg_restore for each of them as a parallel 
background job in the shell.

The whole database cluster is about 1.2 TB in size so I have to find ways to 
restore as many databases in parallel as possible. However pg_restore only 
fails on this single database. All others in that database cluster work fine.

Cheers
Paul

Re: libc to libicu via pg_dump/pg_restore?

Hi Adrian,

> On 6 Feb 2025, at 19:44, Adrian Klaver  wrote:
> 
> By ACL do you mean roles?
> 
> If so roles are global to the cluster not the database, so I am not seeing -C 
> being relevant.
> 
> If not you will need to be more specific about what you are referring to.

I did a "pg_dumpall -r >roles.sql" on the originale database cluster and "psql 
-f roles.sql" on the new database cluster. So, roles are pre-created as is 
necessary.

No, I mean ACLs, like in "Access privileges" when doing a "\l".

Cheers,
Paul

Re: How to perform a long running dry run transaction without blocking

On 2/6/25 12:08, Robert Leach wrote:

Alright I am trying to reconcile this with from below, 'The largest 
studies take just under a minute'.

The context of the 'The largest studies take just under a minute' 
statement is that it's not loading the hefty/time-consuming raw data. 
  It's only validating the metadata.  That's fast (5-60s).  And that 
data is a portion of the transaction in the back-end load.  There are 
errors that validation can miss that are due to not touching the raw 
data, and in fact, those errors are addressed by curators editing the 
excel sheets.  That's why it's all in the load transaction instead of 

As a scientist that makes me start to twitch.

Is there an audit trail for that?

I'm unfamiliar with retry functionality, but those options sound logical 
to me as a good path forward, particularly using celery to spread out 
validations and doing the back end loads at night (or using some sort of 
fast dump/load).  The thing that bothers me about the celery solution is 
that most of the time, 2 users validating different data will not block, 
so I would be making users wait for no reason.  Ideally, I could 
anticipate the block and only at that point, separate those validations.

Aah, time travel.

For fast dump/load on validated data see:

https://www.postgresql.org/docs/current/sql-copy.html

Though note in Postgres 16- COPY is all or nothing, so if there is an 
error nothing will be loaded. With version 17 you get ON_ERROR and 
LOG_VERBOSITY. One way to deal with is to load to a staging table and do 
your validation there and then move the data to the final table.

As to retry that depends on where you want to do it. For subtransactions 
(SAVEPOINT) see:

https://www.postgresql.org/docs/current/sql-savepoint.html

https://www.cybertec-postgresql.com/en/subtransactions-and-performance-in-postgresql/

In Python there is try/except.

This brings up a question though about a possibility I suspect is not 
practical.  My initial read of the isolation levels documentation found 
this section really promising:

 > The Repeatable Read isolation level only sees data committed before 
the transaction began; it never sees either uncommitted data or changes 
committed during transaction execution by concurrent transactions.

This was before I realized that the actions of the previously started 
transaction would include "locks" that would block validation even 
though the load transaction hasn't committed yet:

 > a target row might have already been updated (or deleted or *locked*) 
by another concurrent transaction by the time it is found. In this case, 
the repeatable read transaction will wait for the first updating 
transaction to commit or roll back

Other documentation I read referred to the state of the DB (when a 
transaction starts) as a "snapshot" and I thought... what if I could 
save such a snapshot automatically just *before* a back-end load starts, 
and use that snapshot for validation, such that my validation processes 
could use that to validate against and not encounter any locks?  The 
validation will never commit, so there's no risk.

Hmm. I don't think so.

I know Django's ORM wouldn't support that, but I kind of hoped that 
someone in this email list might suggest a snapshot functionality as a 
possible solution.  Since the validations never commit, the only 
downside would be if the backend load changed something that introduces 
a problem with the validated data that would not be fixed until we 
actually attempt to load it.

Is that too science-fictiony of an idea?

--
Adrian Klaver
adrian.kla...@aklaver.com

Re: How to perform a long running dry run transaction without blocking


On 2/6/25 11:25, Robert Leach wrote:

Have you considered a validation app? Have it read the inputs and look in db 
for conflicts, rather than attempt the insert. Zero transactions necessary



I did consider that about a year or two ago when I first conceived the data 
validation interface.  Doing that now would solve the problem of concurrent 
validations blocking one another, but selects would still get blocked if their 
result is pending the outcome of a load running in the background.  If I 
mitigate that issue by running those loads over night on a schedule, I still 
lose out on the benefits of having the loading code do the validation for me...

I would have to explicitly find and report on problems that the load exceptions 
currently do for me, without any extra code.

So what I'm saying is that the data being validated is inter-dependent.  There 
are about a dozen different loading scripts (one for each sheet of the uploaded 
excel file) whose runs are orchestrated by a master load script that ensures 
they are run in the right order so that the interdependent data can be checked. 
 For example, these are some relative orders of what needs to be loaded so that 
data can be validated:

Study > Animals
Tracers > Infusates > Animals
Treatments > Animals
Tissues > Samples

The Animal load script would fail if the new data in (the tables) Study, 
Tracers, Infusates, and Treatments aren't inserted, because it links to those 
newly created records.  And there's no way to detect problems in those new 
relationships in the unchanged database if they aren't inserted.  That's what 
doing this all in a transaction, and actually doing the inserts (for which I 
use Django `get_or_create` method calls) provides.

In other words, I would have to save and explicitly check the inter-related sheet data in 
data structures independent of the database in order to find the equivalent of (for 
example) `ObjectDoesNotExist` errors that originate from the database.  Right now, I get 
those errors caught "for free".  All I have to do is tell the user what 
sheet/row/column is related to that error.  And it saves me the overhead of having to 
maintain synchronicity between separate validation code and loading code when the loading 
code changes.


Seems to me this could be dealt with using a schema named validate that 
contains 'shadow' tables of those in the live schema(s). Import into 
their and see what fails.




Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544



--
Adrian Klaver
adrian.kla...@aklaver.com

Re: Help in vetting Switch from "MD5" to "scram-sha-256" - during DB Upgrade from EC2- PGS - Community Edn ver 13.X to 15.X


On 2/6/25 16:37, Bharani SV-forum wrote:

  Team
I am in the process of doing DB Upgrade from EC2- PGS - Community Edn 
ver 13.X to 15.X including switching from "MD5 " to "scram-sha-256" for 
password hashing and authentication.


We are having tightly integrated appln tech stack having 256+ roles used 
by application with MD5 mechanism and having Password setting's 
replicated in each of the four server's (used to load balance the data 
flow to each server wise and  each stream Wise (e.g

LAYER#A-server#1,LAYER#A-server#2,LAYER#A-server#3,LAYER#A-server#4
LAYER#B-server#1,LAYER#B-server#2,LAYER#B-server#3,LAYER#B-server#4
etc and had been embedded in an config file in each server wise

We want to have very minimum down time, during the PG upgrade from ver 
13.X to 15.X, need to , how to switch over "MD5" to "scram-sha-256" for 
password hashing and authentication.



https://www.postgresql.org/docs/15/auth-password.html

"md5

The method md5 uses a custom less secure challenge-response 
mechanism. It prevents password sniffing and avoids storing passwords on 
the server in plain text but provides no protection if an attacker 
manages to steal the password hash from the server. Also, the MD5 hash 
algorithm is nowadays no longer considered secure against determined 
attacks.


The md5 method cannot be used with the db_user_namespace feature.

To ease transition from the md5 method to the newer SCRAM method, 
if md5 is specified as a method in pg_hba.conf but the user's password 
on the server is encrypted for SCRAM (see below), then SCRAM-based 
authentication will automatically be chosen instead.

"

This means you can upgrade with the md5 passwords and then change over 
to scram-sha-256 as needed.




Need the best practice including steps to avoid mandated change of 
Application related Role Password during db migration ,as the # of 
roles/userid count is more.


Regards
Bharani




--
Adrian Klaver
adrian.kla...@aklaver.com

Re: Help in vetting Switch from "MD5" to "scram-sha-256" - during DB Upgrade from EC2- PGS - Community Edn ver 13.X to 15.X


On 2/6/25 18:03, Bharani SV-forum wrote:

Adrian
TQ for your valuable input's.

*Additional Qsn*

Assume  DB ver = 15.X

By default encryption = scram-sha-256, Assume pg_hba.conf is quoted the 
usage as MD5 for the

  dbuserid "test_usr_1"

*e.g .)*
*
*
hostssl   all test_usr_1 10.20.30.40  md5

i.e .)
Assume if the respective db userid (e.g test_usr_1) is quoted for usage 
md5,  in pg_hba.conf, No Need to Change, the respective *Role/Userid 
password mandatorily.* DB System will allow to use existing password 
with the old MD5 passwords still work, as long as the authentication 
method in pg_hba.conf is set to md5


Yes.

It gives you time to switch the passwords to scram-sha-256 encryption 
after you do the migration. In other words you can have both md5 and 
scram-sha-256 passwords in use without changing the pg_hba.conf lines. 
Once the transition to scram-sha-256 is done then you can change the 
lines to scram-sha-256 and that will prevent use of m5 passwords going 
forward.




e.g.) hostssl     all         LOGS_USER_1 10.9.0.0/21    md5

Is their,  any security problem due to usage of md5 in the pg_hba.conf 
file  with underlying db =15.X ?


You are currently using it, have there been any issues?

If not then moving to Postgres 15 won't change that.



I am Aware ,
(a) *MD5 hash algorithm is nowadays no longer considered secure against 
determined attacks.*

*(a)  MD5 method cannot be used with the db_user_namespace feature.
*







--
Adrian Klaver
adrian.kla...@aklaver.com

How to perform a long running dry run transaction without blocking

I've been trying to solve this problem in Django and I've finally decided after 
over a year going down this path that there's no way to solve it in Django (in 
a database agnostic fashion). So I'm thinking that I need to explore a 
Postgres-specific solution. FYI, I'm not a database expert. I just know enough 
to get myself into trouble like this problem.

Let me try and distill all the back story down to an oversimplified explanation:

I created a scientific data submission validation interface that helps 
researchers compile their data submissions to our database. To do this, I 
decided to add a `--validate` option to the load scripts that essentially 
raises a dry run exception at the end of a load run so that the data is all 
rolled back before leaving the atomic transaction block.

This validation interface skips the raw data load step, which is the heftiest, 
most long running, part and generally finishes in well under a minute.

The interface works fantastically well. It rolls back problematic data in 
smaller transactions and buffers those errors for a final report that the user 
can work to fix in their data files and retry validation until all their 
problems are solved and the data is ready to load.

The problem is that if we have a long running load going on in the background 
(which can take hours) and during that load, a user tries to validate a new 
submission that contains some overlapping common data (e.g. a new tissue type 
or protocol that is currently being loaded) or perhaps they are trying to 
validate data added to an existing submission that is being partially loaded, 
that validation process gets blocked and the validation interface encounters a 
gateway timeout.

I had tried changing the isolation level to both repeatable read and 
serializable, but the hang can still occur (though serializable seems to avoid 
hangs in some cases that repeatable read does not).

My initial interpretation of the isolation level documentation was that 
transactions would not be aware of what is happening in other transactions and 
that if there was a problem, you would only ever get a serialization error at 
the end when the data is committed, but after encountering the hang yesterday, 
I realized that the "snapshot" that the second transaction gets includes locks 
already established by the first transaction.

I had hoped that since my validation interface would never commit, it would 
never have a serialization error or hang.

What options are available that might allow this all to work as I intended? Is 
there some way to tell the validation process transaction to assume that any 
pending changes from another transaction will succeed?

I have lots of ideas on how to mitigate these occurrences, but I don't want to 
bias any potentially helpful responses. Am I screwed from my early decision to 
use the loading code to validate data submissions?

Re: How to perform a long running dry run transaction without blocking