On 2/6/25 08:53, Robert Leach wrote:
Great questions! Responses inline...
Please reply to list also.
Ccing list
Load to where existing table or temporary table?
Existing tables. Note that (and this is a point of contention in our team), we
have some fields that must be universally unique across all study data, so if
we were to employ temporary (empty) tables, we would miss out on the uniqueness
checks.
That is another discussion.
Background: The previous solution for this validation interface that I had tried, used a partial
copy of the database which I called the "validation" database. (I'd been forced to do it
that way because previous developers hadn't used transactions and their "dry run" mode
had side effects that I didn't want to happen due to the validation interface, so instead of engage
in a massive refactor, I tried using a separate database as a quick temporary fix.) But mind you,
all of the code I was writing was in django (python), and in order to maintain 2 different
databases and stay database architecture agnostic, I learned that your code must be littered with
hundreds of insertions of `.using()` (and other) statements, and it turned out that that strategy
is not 100% compatible with every kind of Django ORM thing you can do (e.g. `full_clean()`), so I'd
had to dig around in django core code to ensure every operation was being performed on the correct
database. It was a nightmare to maintain and I happily ripped it all out when I corrected the
original problems by wrapping everything in a transaction.
Yeah, reason why I bypass the ORM.
This validation interface skips the raw data load step, which is the heftiest,
most long running, part and generally finishes in well under a minute.
So what is the validation actually doing?
The raw data is pretty solid. It is automatically generated by hardware (mass
spectrometer) and software (peak analysis suites). So it doesn't need much (if
any) validation. What needs validation is all the metadata associated with the
samples that is totally generated by hand by the researchers.
They enter the data in an excel spreadsheet containing about a dozen inter-related worksheets, named, for example: Study, Animals, Samples, Tissues, Treatments, Sequences
Well there's your problem:) I will spare you my full spreadsheet rant.
Long term is there a thought to have them enter directly into database
where validation can happen in real time
Where are the background processes loading data to?
We first run our loads on a development server with a separate copy of the
database, but then we re-run those same loads on the production server, where
users perform their validations.
The load to the development server does no validation?
If so what is the purpose?
The background processes are other validation runs?
One of the thoughts I'd had to work around the problem was to somehow dump the
data from the development load and load it onto production in some sort of
scheduled downtime or something. However, even if we do that, I'm concerned
that multiple users concurrently validating different submissions would
encounter this blocking issue, and since those validations can take (on the
upper end) just under a minute, it's enough for at least 1 user to encounter a
timeout. I have not yet proven that can happen, but based on my reading of the
postgres documentation, it seems logical.
Seems you are looking for some sort of queuing system.
What are the time constraints for getting the validation turned around.
Not that I know of and that would be a false promise anyway as I know of no
process that is perfect.
I realize that it's a false promise WRT the background load transaction, but
it's a moot concern from the perspective of the validation page, because it is
making a guaranteed promise that it will never commit. All I want is for it to
not be blocked so I can report as many errors as I can to the researcher so
they can advance their submission compilation. besides, there's a pretty solid
bet that since the load succeeded on the development server, it will succeed on
the production server where this block would happen.
So if the load transaction does fail, and the data the validation process was
waiting on (to see the outcome) is not reported as problematic to the user, it
will be reported as problematic to those responsible for the load on the
back-end, so it gets dealt with either way.
--
Adrian Klaver
adrian.kla...@aklaver.com
Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544
--
Adrian Klaver
adrian.kla...@aklaver.com