Re: libc to libicu via pg_dump/pg_restore?
Hi Guillaume, > On 6 Feb 2025, at 11:13, Guillaume Lelarge > wrote: > > You probably don't need --disable-triggers. You should fix errors in the > order they appear. The first one is on the drop of the database: > > ERROR: cannot drop the currently open database > > pg_restore can't drop the database because it's connected to the database. > When you use -c and -C options, you can't connect to the database you want to > restore to. You have to connect to another database, such as postgres, so > that it can do the drop and the create. After both are done, it will connect > to the just-created database to do the restore step. > > Look at the pg_restore man page > (https://www.postgresql.org/docs/current/app-pgrestore.html). It says on the > --create option: > > When this option is used, the database named with -d is used only to issue > the initial DROP DATABASE and CREATE DATABASE commands. All data is restored > into the database name that appears in the archive. This is intended because the dump contains a create database statement which creates the database with libc which is exactly what I do NOT want. I want it to be a libicu database. So I pre-create it as such and inhibit recreation by pg_restore by sitting on it with a session. So the first message about the database not being created is expected and can be ignored. This works fine for all databases so far. My problem is the constraint violation which inhibits the foreign key contraints from being created. Everything works for all databases. Only this one has that problem. And since I disabled triggers during restore, that shouldn't be a problem either. Btw., the parent table contains the rows in question. So they are imported. I just can't make out why there is a problem. Cheers, Paul
libc to libicu via pg_dump/pg_restore?
Hi, I have a problem which I don't understand. I have and do: instance a, libc based, PostgreSQL 15.10: mydb=# \l mydb List of databases Name | Owner | Encoding | Collate |Ctype| ICU Locale | Locale Provider | Access privileges --+-+--+-+-++-+--- mydb | my_user | UTF8 | en_US.UTF-8 | en_US.UTF-8 || libc | $ pg_dump -Fc -Z1 -b mydb -f mydb.dump.gz $ ls -l mydb.dump.gz -rw--- 1 postgres postgres 14660308577 Feb 6 08:45 mydb.dump.gz instance b, libicu based, PostgreSQL 17.2: $ psql postgres # create database mydb; # \l mydb List of databases Name | Owner | Encoding | Locale Provider | Collate |Ctype| Locale | ICU Rules | Access privileges --+--+--+-+-+-++---+--- mydb | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | en-US | | $ pg_restore -cC --if-exists --disable-triggers -d mydb mydb.dump.gz pg_restore: error: could not execute query: ERROR: cannot drop the currently open database Command was: DROP DATABASE IF EXISTS mydb; pg_restore: error: could not execute query: ERROR: database "mydb" already exists Command was: CREATE DATABASE mydb WITH TEMPLATE = template0 ENCODING = 'UTF8' LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8'; pg_restore: error: could not execute query: ERROR: insert or update on table "table_1" violates foreign key constraint "..._fk" DETAIL: Key (dokument_id)=(133680) is not present in table "...". Command was: ALTER TABLE ONLY myschema.table ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES myschema.dokument(id); pg_restore: error: could not execute query: ERROR: insert or update on table "table_2" violates foreign key constraint "..._fk" DETAIL: Key (dokument_id)=(133740) is not present in table "dokument". Command was: ALTER TABLE ONLY vostra2_str.nen_dokument ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES myschema.dokument(id); I'm sorry, I sort of had to anonymize object names. But you should be able to get the gist of it. It's a dreaded message when importing. My goal is to export libc PostgreSQL 15 databases and import them into PostgreSQL 17 as libicu based databases to get away from glibc based sorting. I searched the net to find the "--disable-triggers" disable triggers when running pg_restore but the errors still occur. What am I doing wrong or how can I better achieve that? Any help would be appreciated. Thanks in advance. Paul
Re: libc to libicu via pg_dump/pg_restore?
Hi, On 06/02/2025 10:04, Paul Foerster wrote: Hi, I have a problem which I don't understand. I have and do: instance a, libc based, PostgreSQL 15.10: mydb=# \l mydb List of databases Name | Owner | Encoding | Collate |Ctype| ICU Locale | Locale Provider | Access privileges --+-+--+-+-++-+--- mydb | my_user | UTF8 | en_US.UTF-8 | en_US.UTF-8 || libc | $ pg_dump -Fc -Z1 -b mydb -f mydb.dump.gz $ ls -l mydb.dump.gz -rw--- 1 postgres postgres 14660308577 Feb 6 08:45 mydb.dump.gz instance b, libicu based, PostgreSQL 17.2: $ psql postgres # create database mydb; # \l mydb List of databases Name | Owner | Encoding | Locale Provider | Collate |Ctype| Locale | ICU Rules | Access privileges --+--+--+-+-+-++---+--- mydb | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | en-US | | $ pg_restore -cC --if-exists --disable-triggers -d mydb mydb.dump.gz pg_restore: error: could not execute query: ERROR: cannot drop the currently open database Command was: DROP DATABASE IF EXISTS mydb; pg_restore: error: could not execute query: ERROR: database "mydb" already exists Command was: CREATE DATABASE mydb WITH TEMPLATE = template0 ENCODING = 'UTF8' LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8'; pg_restore: error: could not execute query: ERROR: insert or update on table "table_1" violates foreign key constraint "..._fk" DETAIL: Key (dokument_id)=(133680) is not present in table "...". Command was: ALTER TABLE ONLY myschema.table ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES myschema.dokument(id); pg_restore: error: could not execute query: ERROR: insert or update on table "table_2" violates foreign key constraint "..._fk" DETAIL: Key (dokument_id)=(133740) is not present in table "dokument". Command was: ALTER TABLE ONLY vostra2_str.nen_dokument ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES myschema.dokument(id); I'm sorry, I sort of had to anonymize object names. But you should be able to get the gist of it. It's a dreaded message when importing. My goal is to export libc PostgreSQL 15 databases and import them into PostgreSQL 17 as libicu based databases to get away from glibc based sorting. I searched the net to find the "--disable-triggers" disable triggers when running pg_restore but the errors still occur. What am I doing wrong or how can I better achieve that? Any help would be appreciated. You probably don't need --disable-triggers. You should fix errors in the order they appear. The first one is on the drop of the database: ERROR: cannot drop the currently open database pg_restore can't drop the database because it's connected to the database. When you use -c and -C options, you can't connect to the database you want to restore to. You have to connect to another database, such as postgres, so that it can do the drop and the create. After both are done, it will connect to the just-created database to do the restore step. Look at the pg_restore man page (https://www.postgresql.org/docs/current/app-pgrestore.html). It says on the --create option: When this option is used, the database named with -d is used only to issue the initial DROP DATABASE and CREATE DATABASE commands. All data is restored into the database name that appears in the archive. Regards. -- Guillaume Lelarge Consultant https://dalibo.com
Re: How to perform a long running dry run transaction without blocking
> Have you considered a validation app? Have it read the inputs and look in db > for conflicts, rather than attempt the insert. Zero transactions necessary I did consider that about a year or two ago when I first conceived the data validation interface. Doing that now would solve the problem of concurrent validations blocking one another, but selects would still get blocked if their result is pending the outcome of a load running in the background. If I mitigate that issue by running those loads over night on a schedule, I still lose out on the benefits of having the loading code do the validation for me... I would have to explicitly find and report on problems that the load exceptions currently do for me, without any extra code. So what I'm saying is that the data being validated is inter-dependent. There are about a dozen different loading scripts (one for each sheet of the uploaded excel file) whose runs are orchestrated by a master load script that ensures they are run in the right order so that the interdependent data can be checked. For example, these are some relative orders of what needs to be loaded so that data can be validated: Study > Animals Tracers > Infusates > Animals Treatments > Animals Tissues > Samples The Animal load script would fail if the new data in (the tables) Study, Tracers, Infusates, and Treatments aren't inserted, because it links to those newly created records. And there's no way to detect problems in those new relationships in the unchanged database if they aren't inserted. That's what doing this all in a transaction, and actually doing the inserts (for which I use Django `get_or_create` method calls) provides. In other words, I would have to save and explicitly check the inter-related sheet data in data structures independent of the database in order to find the equivalent of (for example) `ObjectDoesNotExist` errors that originate from the database. Right now, I get those errors caught "for free". All I have to do is tell the user what sheet/row/column is related to that error. And it saves me the overhead of having to maintain synchronicity between separate validation code and loading code when the loading code changes. Robert William Leach Research Software Engineer 133 Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544
Re: libc to libicu via pg_dump/pg_restore?
On 2/6/25 09:37, Paul Foerster wrote: Hi Guillaume, On 6 Feb 2025, at 15:51, Guillaume Lelarge wrote: You're right. Now I see the "create database" query in your previous email. I should have been more careful, sorry for the noise. No problem. Well, the doc says that --disable-triggers is only relevant for data-only restore, which is not your use case. So you don't need it and it won't help you. Yes, I found that out too. But it doesn't hurt. 🤣 Me neither. But another comment. You create the database, so there should be no objects in it. Why do you use the -c, -C, and --if-exists options? Try without them. On a new database, you should only need: pg_restore -d mydb mydb.dump.gz I need -C because I need ACLs to be recreated too. I tried with -C only, i.e. no -c but that doesn't work for some reason. The --if-exists is a script remnant of my past tries to suppress some messages. I'll try removing that as I rewrote my create database script which runs before importing. By ACL do you mean roles? If so roles are global to the cluster not the database, so I am not seeing -C being relevant. If not you will need to be more specific about what you are referring to. Cheers, Paul -- Adrian Klaver adrian.kla...@aklaver.com
Re: Lookup tables
On 2025-02-04 22:41:38 +0100, Thiemo Kellner wrote: > 04.02.2025 18:31:09 Michał Kłeczek : > > > > >> On 4 Feb 2025, at 18:27, Thiemo Kellner > >> wrote: > >> > >> Unless the lookup table is actually a check constraint one > >> can use to populate dropdown boxes in an interface. > > > > That is even worse because it ceases being transactional and users > > might select something different than what they see on the screen. > > I might see what you want to point out. E.g. the table is COLOURS. The > rec with id 1 is RED, the one with id 2 is BLUE, 3 is GREE and so on. > Now you load these values into the dropdown box that sports RED, BLUE, > GREE and so on. While someone selects GREE, there is a maintenance > release changing GREE to YELLOW. So when that someone sends the > selection by id to the backend, not GREE is selected but YELLOW. I fail to see why use of a surrogate key is the problem here. Either changing the color from GREE to YELLOW makes sense or it doesn't. If it doesn't make sense, then it's release which is faulty, not the model. if it does make sense (I'm a bit at a loss when that might be the case, maybe the "color" is just a code word, or maybe they are colors in a design which are arbitrary but must be consistent), then the experience that the user has is exactly the same as if the maintenance release was applied just after they selected the color. Which might be a bit confusing but is almost certainly what is wanted. > A) Your release changed the sementics of the record 3. It's meaning > changed. I cannot recommend doing that. If the release changed the semantics of an existing record the release was almost certainly wrong. > B) If you absolutely must change the semantic, put your application > into maintenance mode in which noone can select anything beforehand. > > If the maintenance would just correct the typo from GREE to GREEN, > nothing would happen. Yor customer still ordered the lavishly green > E-Bike her hear ever desired. Yeah, that's a good example where changing the color from GREE to YELLOW doesn't make sense. Presumably that ID 3 is used as a foreign key in lots of places, e,g. in an inventory table. Your bikes in stock won't just magically change color just because you changed some text in the database. So that change simply doesn't make sense and shouldn't be done as part of a maintenance release. Confusing a few people who just happen to open the dropdown in the wrong second is the least of your problems. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature
Re: How to perform a long running dry run transaction without blocking
>>> The load to the development server does no validation? >>> >>> If so what is the purpose? >>> >>> The background processes are other validation runs? >> It's the same code that executes in both cases (with or without the >> `--validate` flag). All that that flag does is it (effectively) raises the >> dry run exception before it leaves the transaction block, so it always >> validates (whether the flag is supplied or not). > > More for my sake then anything else, why do the load to the development > server at all if the production load is the only one that counts? The software is still in a new major version beta. We're adding features and fixing bugs. It's not unusual to encounter a new bug, fix it on dev to get the load to work, then deploy a point release on prod. And that means repeated load attempts that interfere with the validation interface. Besides, beyond this, we're planning on a separate staging database that dev effectively now is. Sometimes, a curator only finds a technical data issue after the initial load while browsing the newly loaded data on the dev site. >> So the load doesn't fail until the end of the run, which is inefficient from >> a maintenance perspective. I've been thinking of adding a `--failfast` >> option for use on the back end. Haven't done it yet. I started a load >> yesterday in fact that ran 2 hours before it buffered an exception related >> to a newly introduced bug. I fixed the bug and ran the load again. It >> finished sometime between COB yesterday and this morning (successfully!). > > Alright I am trying to reconcile this with from below, 'The largest studies > take just under a minute'. The context of the 'The largest studies take just under a minute' statement is that it's not loading the hefty/time-consuming raw data. It's only validating the metadata. That's fast (5-60s). And that data is a portion of the transaction in the back-end load. There are errors that validation can miss that are due to not touching the raw data, and in fact, those errors are addressed by curators editing the excel sheets. That's why it's all in the load transaction instead of loaded separately, but those problems are somewhat rare (and we currently have a new feature in the design phase that should almost completely eliminate those issues). >>> Seems you are looking for some sort of queuing system. >>> >>> What are the time constraints for getting the validation turned around. >> I have considered a queuing system, though when I previously floated a proof >> of concept using celery, I was informed it was too much. Though, at the >> time, all I was trying to do was a progress bar for a query stats feature. >> So proposing celery in this instance may get more traction with the rest of >> the team. >> Most of the small validation processes finish in under a dozen seconds. >> The largest studies take just under a minute. I have plans to optimize the >> loading scripts that hopefully could get the largest studies down to a dozen >> seconds. If I could do that, and do the back end loads in off-peak hours, >> then I'd be willing to suffer the rare timeouts from concurrent validations. >> The raw data loads will still likely take a much longer time. > > This is where I get confused, probably because I am not exactly sure what > constitutes validation. My sense is that involves a load of data into live > tables and seeing what fails PK, FK or other constraints. > > If that is the case I am not seeing how the 'for real' data load would be > longer? The validation skips the time-consuming raw data load. That raw data is collectively hundreds of gigs in size and could not be uploaded on the validation page anyway. The feature I alluded to above that would make errors associated with the raw data almost completely eliminated is one where the researcher can drop the raw data folder into the form and it just walks the directory to get all the raw data file names and relative paths. It's those data relationships whose validations are currently skipped. > At any rate I can't see how loading into a live database multiple sets of > data while operations are going on in the database can be made conflict free. > To me it seems the best that be done is: > > 1) Reduce chance for conflict by spreading the actions out. > > 2) Have retry logic that deals with conflicts. I'm unfamiliar with retry functionality, but those options sound logical to me as a good path forward, particularly using celery to spread out validations and doing the back end loads at night (or using some sort of fast dump/load). The thing that bothers me about the celery solution is that most of the time, 2 users validating different data will not block, so I would be making users wait for no reason. Ideally, I could anticipate the block and only at that point, separate those validations. This brings up a question though about a possibility I sus
Re: libc to libicu via pg_dump/pg_restore?
Hi Guillaume, > On 6 Feb 2025, at 15:51, Guillaume Lelarge > wrote: > > You're right. Now I see the "create database" query in your previous email. I > should have been more careful, sorry for the noise. No problem. > Well, the doc says that --disable-triggers is only relevant for data-only > restore, which is not your use case. So you don't need it and it won't help > you. Yes, I found that out too. But it doesn't hurt. 🤣 > Me neither. But another comment. You create the database, so there should be > no objects in it. Why do you use the -c, -C, and --if-exists options? Try > without them. On a new database, you should only need: > > pg_restore -d mydb mydb.dump.gz I need -C because I need ACLs to be recreated too. I tried with -C only, i.e. no -c but that doesn't work for some reason. The --if-exists is a script remnant of my past tries to suppress some messages. I'll try removing that as I rewrote my create database script which runs before importing. Cheers, Paul
Re: libc to libicu via pg_dump/pg_restore?
Hi Paul, On 06/02/2025 12:20, Paul Foerster wrote: Hi Guillaume, On 6 Feb 2025, at 11:13, Guillaume Lelarge wrote: You probably don't need --disable-triggers. You should fix errors in the order they appear. The first one is on the drop of the database: ERROR: cannot drop the currently open database pg_restore can't drop the database because it's connected to the database. When you use -c and -C options, you can't connect to the database you want to restore to. You have to connect to another database, such as postgres, so that it can do the drop and the create. After both are done, it will connect to the just-created database to do the restore step. Look at the pg_restore man page (https://www.postgresql.org/docs/current/app-pgrestore.html). It says on the --create option: When this option is used, the database named with -d is used only to issue the initial DROP DATABASE and CREATE DATABASE commands. All data is restored into the database name that appears in the archive. This is intended because the dump contains a create database statement which creates the database with libc which is exactly what I do NOT want. I want it to be a libicu database. So I pre-create it as such and inhibit recreation by pg_restore by sitting on it with a session. So the first message about the database not being created is expected and can be ignored. This works fine for all databases so far. You're right. Now I see the "create database" query in your previous email. I should have been more careful, sorry for the noise. My problem is the constraint violation which inhibits the foreign key contraints from being created. Everything works for all databases. Only this one has that problem. And since I disabled triggers during restore, that shouldn't be a problem either. Well, the doc says that --disable-triggers is only relevant for data-only restore, which is not your use case. So you don't need it and it won't help you. Btw., the parent table contains the rows in question. So they are imported. I just can't make out why there is a problem. Me neither. But another comment. You create the database, so there should be no objects in it. Why do you use the -c, -C, and --if-exists options? Try without them. On a new database, you should only need: pg_restore -d mydb mydb.dump.gz Less options, less weird behaviours. -- Guillaume Lelarge Consultant https://dalibo.com
Re: How to perform a long running dry run transaction without blocking
> Please reply to list also. > Ccing list 👍🏻 >> They enter the data in an excel spreadsheet containing about a dozen >> inter-related worksheets, named, for example: Study, Animals, Samples, >> Tissues, Treatments, Sequences > > Long term is there a thought to have them enter directly into database where > validation can happen in real time No, not really. Sample metadata tends to be an afterthought to researchers. They have it in their notebooks and getting them to enter it at all is like pulling teeth. The validation interface actually has a bunch of other features I haven't mentioned that streamline the process for them. Before it gets to actually validating the data, it tries to lighten the manual burden on the researchers (and help with consistent nomenclature) by pulling sample names out of the raw files, massaging them, and filling those in along with a mass of common data that is used to populate drop-downs in the excel columns to avoid researcher typos and value variants. Having everything work with excel actually made the site more attractive to the researchers, because they're comfortable with it and use it already, so it lowered the bar for using our software. Besides, we don't trust the users enough to enter data unsupervised. There are a lot of aspects of the data that cannot be automatically validated and involve experimental parameters that are adjacent to the purpose of our site. We have curators that need to look at everything to ensure consistency, and looking at all the data in context is necessary before any of it is entered. That said, back in the aughts, I wrote a perl cgi site for a toxin and virulence factor database that used a web interface for data entry and achieved the curation goal by saving a form of all inter-related data. The submit button sent that form to a list of curators who could approve the insert/update and make it actually happen. I think I had actually suggested that form of data entry when this current project first started, but I was overruled. However, in this project, the equivalent procedure would be per-sample, and you'd lose out on the overall context. It's an interesting challenge, but I think we're pretty committed now on this file load path. >>> Where are the background processes loading data to? >> We first run our loads on a development server with a separate copy of the >> database, but then we re-run those same loads on the production server, >> where users perform their validations. > > The load to the development server does no validation? > > If so what is the purpose? > > The background processes are other validation runs? It's the same code that executes in both cases (with or without the `--validate` flag). All that that flag does is it (effectively) raises the dry run exception before it leaves the transaction block, so it always validates (whether the flag is supplied or not). So the load doesn't fail until the end of the run, which is inefficient from a maintenance perspective. I've been thinking of adding a `--failfast` option for use on the back end. Haven't done it yet. I started a load yesterday in fact that ran 2 hours before it buffered an exception related to a newly introduced bug. I fixed the bug and ran the load again. It finished sometime between COB yesterday and this morning (successfully!). >> One of the thoughts I'd had to work around the problem was to somehow dump >> the data from the development load and load it onto production in some sort >> of scheduled downtime or something. However, even if we do that, I'm >> concerned that multiple users concurrently validating different submissions >> would encounter this blocking issue, and since those validations can take >> (on the upper end) just under a minute, it's enough for at least 1 user to >> encounter a timeout. I have not yet proven that can happen, but based on my >> reading of the postgres documentation, it seems logical. > > Seems you are looking for some sort of queuing system. > > What are the time constraints for getting the validation turned around. I have considered a queuing system, though when I previously floated a proof of concept using celery, I was informed it was too much. Though, at the time, all I was trying to do was a progress bar for a query stats feature. So proposing celery in this instance may get more traction with the rest of the team. Most of the small validation processes finish in under a dozen seconds. The largest studies take just under a minute. I have plans to optimize the loading scripts that hopefully could get the largest studies down to a dozen seconds. If I could do that, and do the back end loads in off-peak hours, then I'd be willing to suffer the rare timeouts from concurrent validations. The raw data loads will still likely take a much longer time. Robert William Leach Research Software Engineer 133 Carl C. Icahn Lab Lewis-Sigler Institu
Re: How to perform a long running dry run transaction without blocking
On 2/6/25 09:58, Robert Leach wrote: They enter the data in an excel spreadsheet containing about a dozen inter-related worksheets, named, for example: Study, Animals, Samples, Tissues, Treatments, Sequences Long term is there a thought to have them enter directly into database where validation can happen in real time Having everything work with excel actually made the site more attractive to the researchers, because they're comfortable with it and use it already, so it lowered the bar for using our software. [...] It's an interesting challenge, but I think we're pretty committed now on this file load path. Ok, that is generally how my discussions on this subject end. It's my windmill. The load to the development server does no validation? If so what is the purpose? The background processes are other validation runs? It's the same code that executes in both cases (with or without the `--validate` flag). All that that flag does is it (effectively) raises the dry run exception before it leaves the transaction block, so it always validates (whether the flag is supplied or not). More for my sake then anything else, why do the load to the development server at all if the production load is the only one that counts? So the load doesn't fail until the end of the run, which is inefficient from a maintenance perspective. I've been thinking of adding a `--failfast` option for use on the back end. Haven't done it yet. I started a load yesterday in fact that ran 2 hours before it buffered an exception related to a newly introduced bug. I fixed the bug and ran the load again. It finished sometime between COB yesterday and this morning (successfully!). Alright I am trying to reconcile this with from below, 'The largest studies take just under a minute'. Seems you are looking for some sort of queuing system. What are the time constraints for getting the validation turned around. I have considered a queuing system, though when I previously floated a proof of concept using celery, I was informed it was too much. Though, at the time, all I was trying to do was a progress bar for a query stats feature. So proposing celery in this instance may get more traction with the rest of the team. Most of the small validation processes finish in under a dozen seconds. The largest studies take just under a minute. I have plans to optimize the loading scripts that hopefully could get the largest studies down to a dozen seconds. If I could do that, and do the back end loads in off-peak hours, then I'd be willing to suffer the rare timeouts from concurrent validations. The raw data loads will still likely take a much longer time. This is where I get confused, probably because I am not exactly sure what constitutes validation. My sense is that involves a load of data into live tables and seeing what fails PK, FK or other constraints. If that is the case I am not seeing how the 'for real' data load would be longer? At any rate I can't see how loading into a live database multiple sets of data while operations are going on in the database can be made conflict free. To me it seems the best that be done is: 1) Reduce chance for conflict by spreading the actions out. 2) Have retry logic that deals with conflicts. Robert William Leach Research Software Engineer 133 Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544 -- Adrian Klaver adrian.kla...@aklaver.com
Re: How to perform a long running dry run transaction without blocking
> Have you considered a validation app? Have it read the inputs and look in db for conflicts, rather than attempt the insert. Zero transactions necessary
Re: libc to libicu via pg_dump/pg_restore?
On 2/6/25 01:04, Paul Foerster wrote: Hi, Comments inline. I have a problem which I don't understand. I have and do: instance a, libc based, PostgreSQL 15.10: mydb=# \l mydb List of databases Name | Owner | Encoding | Collate |Ctype| ICU Locale | Locale Provider | Access privileges --+-+--+-+-++-+--- mydb | my_user | UTF8 | en_US.UTF-8 | en_US.UTF-8 || libc | $ pg_dump -Fc -Z1 -b mydb -f mydb.dump.gz $ ls -l mydb.dump.gz -rw--- 1 postgres postgres 14660308577 Feb 6 08:45 mydb.dump.gz instance b, libicu based, PostgreSQL 17.2: $ psql postgres # create database mydb; # \l mydb List of databases Name | Owner | Encoding | Locale Provider | Collate |Ctype| Locale | ICU Rules | Access privileges --+--+--+-+-+-++---+--- mydb | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | en-US | | $ pg_restore -cC --if-exists --disable-triggers -d mydb mydb.dump.gz I would start by doing: 1) Log into postgres database and do: a) DROP DATABASE mydb; b) CREATE DATABASE mydb ; 2) pg_restore -d mydb mydb.dump.gz pg_restore: error: could not execute query: ERROR: insert or update on table "table_1" violates foreign key constraint "..._fk" DETAIL: Key (dokument_id)=(133680) is not present in table "...". Is dokument_id an integer field? Command was: ALTER TABLE ONLY myschema.table ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES myschema.dokument(id); pg_restore: error: could not execute query: ERROR: insert or update on table "table_2" violates foreign key constraint "..._fk" DETAIL: Key (dokument_id)=(133740) is not present in table "dokument". Command was: ALTER TABLE ONLY vostra2_str.nen_dokument ADD CONSTRAINT table_fk FOREIGN KEY (dokument_id) REFERENCES myschema.dokument(id); What am I doing wrong or how can I better achieve that? Any help would be appreciated. In a follow post you said: "Everything works for all databases. Only this one has that problem." Do you mean you made the same libc --> icu change on the other databases with no errors? Thanks in advance. Paul -- Adrian Klaver adrian.kla...@aklaver.com
Re: How to perform a long running dry run transaction without blocking
On 2/6/25 08:53, Robert Leach wrote: Great questions! Responses inline... Please reply to list also. Ccing list Load to where existing table or temporary table? Existing tables. Note that (and this is a point of contention in our team), we have some fields that must be universally unique across all study data, so if we were to employ temporary (empty) tables, we would miss out on the uniqueness checks. That is another discussion. Background: The previous solution for this validation interface that I had tried, used a partial copy of the database which I called the "validation" database. (I'd been forced to do it that way because previous developers hadn't used transactions and their "dry run" mode had side effects that I didn't want to happen due to the validation interface, so instead of engage in a massive refactor, I tried using a separate database as a quick temporary fix.) But mind you, all of the code I was writing was in django (python), and in order to maintain 2 different databases and stay database architecture agnostic, I learned that your code must be littered with hundreds of insertions of `.using()` (and other) statements, and it turned out that that strategy is not 100% compatible with every kind of Django ORM thing you can do (e.g. `full_clean()`), so I'd had to dig around in django core code to ensure every operation was being performed on the correct database. It was a nightmare to maintain and I happily ripped it all out when I corrected the original problems by wrapping everything in a transaction. Yeah, reason why I bypass the ORM. This validation interface skips the raw data load step, which is the heftiest, most long running, part and generally finishes in well under a minute. So what is the validation actually doing? The raw data is pretty solid. It is automatically generated by hardware (mass spectrometer) and software (peak analysis suites). So it doesn't need much (if any) validation. What needs validation is all the metadata associated with the samples that is totally generated by hand by the researchers. They enter the data in an excel spreadsheet containing about a dozen inter-related worksheets, named, for example: Study, Animals, Samples, Tissues, Treatments, Sequences Well there's your problem:) I will spare you my full spreadsheet rant. Long term is there a thought to have them enter directly into database where validation can happen in real time Where are the background processes loading data to? We first run our loads on a development server with a separate copy of the database, but then we re-run those same loads on the production server, where users perform their validations. The load to the development server does no validation? If so what is the purpose? The background processes are other validation runs? One of the thoughts I'd had to work around the problem was to somehow dump the data from the development load and load it onto production in some sort of scheduled downtime or something. However, even if we do that, I'm concerned that multiple users concurrently validating different submissions would encounter this blocking issue, and since those validations can take (on the upper end) just under a minute, it's enough for at least 1 user to encounter a timeout. I have not yet proven that can happen, but based on my reading of the postgres documentation, it seems logical. Seems you are looking for some sort of queuing system. What are the time constraints for getting the validation turned around. Not that I know of and that would be a false promise anyway as I know of no process that is perfect. I realize that it's a false promise WRT the background load transaction, but it's a moot concern from the perspective of the validation page, because it is making a guaranteed promise that it will never commit. All I want is for it to not be blocked so I can report as many errors as I can to the researcher so they can advance their submission compilation. besides, there's a pretty solid bet that since the load succeeded on the development server, it will succeed on the production server where this block would happen. So if the load transaction does fail, and the data the validation process was waiting on (to see the outcome) is not reported as problematic to the user, it will be reported as problematic to those responsible for the load on the back-end, so it gets dealt with either way. -- Adrian Klaver adrian.kla...@aklaver.com Robert William Leach Research Software Engineer 133 Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544 -- Adrian Klaver adrian.kla...@aklaver.com
Re: libc to libicu via pg_dump/pg_restore?
Hi Adrian, > On 6 Feb 2025, at 17:31, Adrian Klaver wrote: > > 1) Log into postgres database and do: > > a) DROP DATABASE mydb; > b) CREATE DATABASE mydb ; > > 2) pg_restore -d mydb mydb.dump.gz With create database being "template template0", this is what my script does. But I need the -cC options for pg_restore to get ACLs back. Leaving out either one of them will not get me the ACLs back. >> pg_restore: error: could not execute query: ERROR: insert or update on >> table "table_1" violates foreign key constraint "..._fk" >> DETAIL: Key (dokument_id)=(133680) is not present in table "...". > > Is dokument_id an integer field? Yes, it's a bigint. > In a follow post you said: > > "Everything works for all databases. Only this one has that problem." > > Do you mean you made the same libc --> icu change on the other databases with > no errors? Yes, I have that PostgreSQL 15.10 cluster with locale provider libc with about 40 databases. I initdb'ed a new PostgreSQL 17.2 cluster with icu as locale provider and did a "create database ... template template0" for all about 40 databases. Then I did the mentioned pg_restore for each of them as a parallel background job in the shell. The whole database cluster is about 1.2 TB in size so I have to find ways to restore as many databases in parallel as possible. However pg_restore only fails on this single database. All others in that database cluster work fine. Cheers Paul
Re: libc to libicu via pg_dump/pg_restore?
Hi Adrian, > On 6 Feb 2025, at 19:44, Adrian Klaver wrote: > > By ACL do you mean roles? > > If so roles are global to the cluster not the database, so I am not seeing -C > being relevant. > > If not you will need to be more specific about what you are referring to. I did a "pg_dumpall -r >roles.sql" on the originale database cluster and "psql -f roles.sql" on the new database cluster. So, roles are pre-created as is necessary. No, I mean ACLs, like in "Access privileges" when doing a "\l". Cheers, Paul
Re: How to perform a long running dry run transaction without blocking
On 2/6/25 12:08, Robert Leach wrote: Alright I am trying to reconcile this with from below, 'The largest studies take just under a minute'. The context of the 'The largest studies take just under a minute' statement is that it's not loading the hefty/time-consuming raw data. It's only validating the metadata. That's fast (5-60s). And that data is a portion of the transaction in the back-end load. There are errors that validation can miss that are due to not touching the raw data, and in fact, those errors are addressed by curators editing the excel sheets. That's why it's all in the load transaction instead of As a scientist that makes me start to twitch. Is there an audit trail for that? I'm unfamiliar with retry functionality, but those options sound logical to me as a good path forward, particularly using celery to spread out validations and doing the back end loads at night (or using some sort of fast dump/load). The thing that bothers me about the celery solution is that most of the time, 2 users validating different data will not block, so I would be making users wait for no reason. Ideally, I could anticipate the block and only at that point, separate those validations. Aah, time travel. For fast dump/load on validated data see: https://www.postgresql.org/docs/current/sql-copy.html Though note in Postgres 16- COPY is all or nothing, so if there is an error nothing will be loaded. With version 17 you get ON_ERROR and LOG_VERBOSITY. One way to deal with is to load to a staging table and do your validation there and then move the data to the final table. As to retry that depends on where you want to do it. For subtransactions (SAVEPOINT) see: https://www.postgresql.org/docs/current/sql-savepoint.html https://www.cybertec-postgresql.com/en/subtransactions-and-performance-in-postgresql/ In Python there is try/except. This brings up a question though about a possibility I suspect is not practical. My initial read of the isolation levels documentation found this section really promising: > The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions. This was before I realized that the actions of the previously started transaction would include "locks" that would block validation even though the load transaction hasn't committed yet: > a target row might have already been updated (or deleted or *locked*) by another concurrent transaction by the time it is found. In this case, the repeatable read transaction will wait for the first updating transaction to commit or roll back Other documentation I read referred to the state of the DB (when a transaction starts) as a "snapshot" and I thought... what if I could save such a snapshot automatically just *before* a back-end load starts, and use that snapshot for validation, such that my validation processes could use that to validate against and not encounter any locks? The validation will never commit, so there's no risk. Hmm. I don't think so. I know Django's ORM wouldn't support that, but I kind of hoped that someone in this email list might suggest a snapshot functionality as a possible solution. Since the validations never commit, the only downside would be if the backend load changed something that introduces a problem with the validated data that would not be fixed until we actually attempt to load it. Is that too science-fictiony of an idea? -- Adrian Klaver adrian.kla...@aklaver.com
Re: How to perform a long running dry run transaction without blocking
On 2/6/25 11:25, Robert Leach wrote: Have you considered a validation app? Have it read the inputs and look in db for conflicts, rather than attempt the insert. Zero transactions necessary I did consider that about a year or two ago when I first conceived the data validation interface. Doing that now would solve the problem of concurrent validations blocking one another, but selects would still get blocked if their result is pending the outcome of a load running in the background. If I mitigate that issue by running those loads over night on a schedule, I still lose out on the benefits of having the loading code do the validation for me... I would have to explicitly find and report on problems that the load exceptions currently do for me, without any extra code. So what I'm saying is that the data being validated is inter-dependent. There are about a dozen different loading scripts (one for each sheet of the uploaded excel file) whose runs are orchestrated by a master load script that ensures they are run in the right order so that the interdependent data can be checked. For example, these are some relative orders of what needs to be loaded so that data can be validated: Study > Animals Tracers > Infusates > Animals Treatments > Animals Tissues > Samples The Animal load script would fail if the new data in (the tables) Study, Tracers, Infusates, and Treatments aren't inserted, because it links to those newly created records. And there's no way to detect problems in those new relationships in the unchanged database if they aren't inserted. That's what doing this all in a transaction, and actually doing the inserts (for which I use Django `get_or_create` method calls) provides. In other words, I would have to save and explicitly check the inter-related sheet data in data structures independent of the database in order to find the equivalent of (for example) `ObjectDoesNotExist` errors that originate from the database. Right now, I get those errors caught "for free". All I have to do is tell the user what sheet/row/column is related to that error. And it saves me the overhead of having to maintain synchronicity between separate validation code and loading code when the loading code changes. Seems to me this could be dealt with using a schema named validate that contains 'shadow' tables of those in the live schema(s). Import into their and see what fails. Robert William Leach Research Software Engineer 133 Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544 -- Adrian Klaver adrian.kla...@aklaver.com
Re: Help in vetting Switch from "MD5" to "scram-sha-256" - during DB Upgrade from EC2- PGS - Community Edn ver 13.X to 15.X
On 2/6/25 16:37, Bharani SV-forum wrote: Team I am in the process of doing DB Upgrade from EC2- PGS - Community Edn ver 13.X to 15.X including switching from "MD5 " to "scram-sha-256" for password hashing and authentication. We are having tightly integrated appln tech stack having 256+ roles used by application with MD5 mechanism and having Password setting's replicated in each of the four server's (used to load balance the data flow to each server wise and each stream Wise (e.g LAYER#A-server#1,LAYER#A-server#2,LAYER#A-server#3,LAYER#A-server#4 LAYER#B-server#1,LAYER#B-server#2,LAYER#B-server#3,LAYER#B-server#4 etc and had been embedded in an config file in each server wise We want to have very minimum down time, during the PG upgrade from ver 13.X to 15.X, need to , how to switch over "MD5" to "scram-sha-256" for password hashing and authentication. https://www.postgresql.org/docs/15/auth-password.html "md5 The method md5 uses a custom less secure challenge-response mechanism. It prevents password sniffing and avoids storing passwords on the server in plain text but provides no protection if an attacker manages to steal the password hash from the server. Also, the MD5 hash algorithm is nowadays no longer considered secure against determined attacks. The md5 method cannot be used with the db_user_namespace feature. To ease transition from the md5 method to the newer SCRAM method, if md5 is specified as a method in pg_hba.conf but the user's password on the server is encrypted for SCRAM (see below), then SCRAM-based authentication will automatically be chosen instead. " This means you can upgrade with the md5 passwords and then change over to scram-sha-256 as needed. Need the best practice including steps to avoid mandated change of Application related Role Password during db migration ,as the # of roles/userid count is more. Regards Bharani -- Adrian Klaver adrian.kla...@aklaver.com
Re: Help in vetting Switch from "MD5" to "scram-sha-256" - during DB Upgrade from EC2- PGS - Community Edn ver 13.X to 15.X
On 2/6/25 18:03, Bharani SV-forum wrote: Adrian TQ for your valuable input's. *Additional Qsn* Assume DB ver = 15.X By default encryption = scram-sha-256, Assume pg_hba.conf is quoted the usage as MD5 for the dbuserid "test_usr_1" *e.g .)* * * hostssl all test_usr_1 10.20.30.40 md5 i.e .) Assume if the respective db userid (e.g test_usr_1) is quoted for usage md5, in pg_hba.conf, No Need to Change, the respective *Role/Userid password mandatorily.* DB System will allow to use existing password with the old MD5 passwords still work, as long as the authentication method in pg_hba.conf is set to md5 Yes. It gives you time to switch the passwords to scram-sha-256 encryption after you do the migration. In other words you can have both md5 and scram-sha-256 passwords in use without changing the pg_hba.conf lines. Once the transition to scram-sha-256 is done then you can change the lines to scram-sha-256 and that will prevent use of m5 passwords going forward. e.g.) hostssl all LOGS_USER_1 10.9.0.0/21 md5 Is their, any security problem due to usage of md5 in the pg_hba.conf file with underlying db =15.X ? You are currently using it, have there been any issues? If not then moving to Postgres 15 won't change that. I am Aware , (a) *MD5 hash algorithm is nowadays no longer considered secure against determined attacks.* *(a) MD5 method cannot be used with the db_user_namespace feature. * -- Adrian Klaver adrian.kla...@aklaver.com
How to perform a long running dry run transaction without blocking
I've been trying to solve this problem in Django and I've finally decided after over a year going down this path that there's no way to solve it in Django (in a database agnostic fashion). So I'm thinking that I need to explore a Postgres-specific solution. FYI, I'm not a database expert. I just know enough to get myself into trouble like this problem. Let me try and distill all the back story down to an oversimplified explanation: I created a scientific data submission validation interface that helps researchers compile their data submissions to our database. To do this, I decided to add a `--validate` option to the load scripts that essentially raises a dry run exception at the end of a load run so that the data is all rolled back before leaving the atomic transaction block. This validation interface skips the raw data load step, which is the heftiest, most long running, part and generally finishes in well under a minute. The interface works fantastically well. It rolls back problematic data in smaller transactions and buffers those errors for a final report that the user can work to fix in their data files and retry validation until all their problems are solved and the data is ready to load. The problem is that if we have a long running load going on in the background (which can take hours) and during that load, a user tries to validate a new submission that contains some overlapping common data (e.g. a new tissue type or protocol that is currently being loaded) or perhaps they are trying to validate data added to an existing submission that is being partially loaded, that validation process gets blocked and the validation interface encounters a gateway timeout. I had tried changing the isolation level to both repeatable read and serializable, but the hang can still occur (though serializable seems to avoid hangs in some cases that repeatable read does not). My initial interpretation of the isolation level documentation was that transactions would not be aware of what is happening in other transactions and that if there was a problem, you would only ever get a serialization error at the end when the data is committed, but after encountering the hang yesterday, I realized that the "snapshot" that the second transaction gets includes locks already established by the first transaction. I had hoped that since my validation interface would never commit, it would never have a serialization error or hang. What options are available that might allow this all to work as I intended? Is there some way to tell the validation process transaction to assume that any pending changes from another transaction will succeed? I have lots of ideas on how to mitigate these occurrences, but I don't want to bias any potentially helpful responses. Am I screwed from my early decision to use the loading code to validate data submissions?
Re: How to perform a long running dry run transaction without blocking
On 2/6/25 07:40, Robert Leach wrote: Comments inline. Let me try and distill all the back story down to an oversimplified explanation: I created a scientific data submission validation interface that helps researchers compile their data submissions to our database. To do this, I decided to add a `--validate` option to the load scripts that essentially raises a dry run exception at the end of a load run so that the data is all rolled back before leaving the atomic transaction block. Load to where existing table or temporary table? This validation interface skips the raw data load step, which is the heftiest, most long running, part and generally finishes in well under a minute. So what is the validation actually doing? The interface works fantastically well. It rolls back problematic data in smaller transactions and buffers those errors for a final report that the user can work to fix in their data files and retry validation until all their problems are solved and the data is ready to load. The problem is that if we have a long running load going on in the background (which can take hours) and during that load, a user tries to validate a new submission that contains some overlapping common data (e.g. a new tissue type or protocol that is currently being loaded) or perhaps they are trying to validate data added to an existing submission that is being partially loaded, that validation process gets blocked and the validation interface encounters a gateway timeout. Where are the background processes loading data to? I had tried changing the isolation level to both repeatable read and serializable, but the hang can still occur (though serializable seems to avoid hangs in some cases that repeatable read does not). My initial interpretation of the isolation level documentation was that transactions would not be aware of what is happening in other transactions and that if there was a problem, you would only ever get a serialization error at the end when the data is committed, but after encountering the hang yesterday, I realized that the "snapshot" that the second transaction gets includes locks already established by the first transaction. I had hoped that since my validation interface would never commit, it would never have a serialization error or hang. What options are available that might allow this all to work as I intended? Is there some way to tell the validation process transaction to assume that any pending changes from another transaction will succeed? Not that I know of and that would be a false promise anyway as I know of no process that is perfect. I have lots of ideas on how to mitigate these occurrences, but I don't want to bias any potentially helpful responses. Am I screwed from my early decision to use the loading code to validate data submissions? -- Adrian Klaver adrian.kla...@aklaver.com