Re: How to perform a long running dry run transaction without blocking

Adrian Klaver Thu, 06 Feb 2025 09:16:32 -0800

On 2/6/25 08:53, Robert Leach wrote:

Great questions!  Responses inline...


Please reply to list also.
Ccing list

Load to where existing table or temporary table?


Existing tables.  Note that (and this is a point of contention in our team), we 
have some fields that must be universally unique across all study data, so if 
we were to employ temporary (empty) tables, we would miss out on the uniqueness 
checks.


That is another discussion.


Background: The previous solution for this validation interface that I had tried, used a partial 
copy of the database which I called the "validation" database.  (I'd been forced to do it 
that way because previous developers hadn't used transactions and their "dry run" mode 
had side effects that I didn't want to happen due to the validation interface, so instead of engage 
in a massive refactor, I tried using a separate database as a quick temporary fix.)  But mind you, 
all of the code I was writing was in django (python), and in order to maintain 2 different 
databases and stay database architecture agnostic, I learned that your code must be littered with 
hundreds of insertions of `.using()` (and other) statements, and it turned out that that strategy 
is not 100% compatible with every kind of Django ORM thing you can do (e.g. `full_clean()`), so I'd 
had to dig around in django core code to ensure every operation was being performed on the correct 
database.  It was a nightmare to maintain and I happily ripped it all out when I corrected the 
original problems by wrapping everything in a transaction.


Yeah, reason why I bypass the ORM.

This validation interface skips the raw data load step, which is the heftiest, 
most long running, part and generally finishes in well under a minute.


So what is the validation actually doing?


The raw data is pretty solid.  It is automatically generated by hardware (mass 
spectrometer) and software (peak analysis suites).  So it doesn't need much (if 
any) validation.  What needs validation is all the metadata associated with the 
samples that is totally generated by hand by the researchers.

They enter the data in an excel spreadsheet containing about a dozen inter-related worksheets, named, for example: Study, Animals, Samples, Tissues, Treatments, Sequences


Well there's your problem:) I will spare you my full spreadsheet rant.

Long term is there a thought to have them enter directly into databasewhere validation can happen in real time

Where are the background processes loading data to?


We first run our loads on a development server with a separate copy of the 
database, but then we re-run those same loads on the production server, where 
users perform their validations.


The load to the development server does no validation?

If so what is the purpose?

The background processes are other validation runs?


One of the thoughts I'd had to work around the problem was to somehow dump the 
data from the development load and load it onto production in some sort of 
scheduled downtime or something.  However, even if we do that, I'm concerned 
that multiple users concurrently validating different submissions would 
encounter this blocking issue, and since those validations can take (on the 
upper end) just under a minute, it's enough for at least 1 user to encounter a 
timeout.  I have not yet proven that can happen, but based on my reading of the 
postgres documentation, it seems logical.


Seems you are looking for some sort of queuing system.

What are the time constraints for getting the validation turned around.

Not that I know of and that would be a false promise anyway as I know of no 
process that is perfect.


I realize that it's a false promise WRT the background load transaction, but 
it's a moot concern from the perspective of the validation page, because it is 
making a guaranteed promise that it will never commit.  All I want is for it to 
not be blocked so I can report as many errors as I can to the researcher so 
they can advance their submission compilation.  besides, there's a pretty solid 
bet that since the load succeeded on the development server, it will succeed on 
the production server where this block would happen.

So if the load transaction does fail, and the data the validation process was 
waiting on (to see the outcome) is not reported as problematic to the user, it 
will be reported as problematic to those responsible for the load on the 
back-end, so it gets dealt with either way.


--
Adrian Klaver
adrian.kla...@aklaver.com




Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544


--
Adrian Klaver
adrian.kla...@aklaver.com

Re: How to perform a long running dry run transaction without blocking

Reply via email to