On 2/6/25 11:25, Robert Leach wrote:
Have you considered a validation app? Have it read the inputs and look in db 
for conflicts, rather than attempt the insert. Zero transactions necessary


I did consider that about a year or two ago when I first conceived the data 
validation interface.  Doing that now would solve the problem of concurrent 
validations blocking one another, but selects would still get blocked if their 
result is pending the outcome of a load running in the background.  If I 
mitigate that issue by running those loads over night on a schedule, I still 
lose out on the benefits of having the loading code do the validation for me...

I would have to explicitly find and report on problems that the load exceptions 
currently do for me, without any extra code.

So what I'm saying is that the data being validated is inter-dependent.  There 
are about a dozen different loading scripts (one for each sheet of the uploaded 
excel file) whose runs are orchestrated by a master load script that ensures 
they are run in the right order so that the interdependent data can be checked. 
 For example, these are some relative orders of what needs to be loaded so that 
data can be validated:

Study > Animals
Tracers > Infusates > Animals
Treatments > Animals
Tissues > Samples

The Animal load script would fail if the new data in (the tables) Study, 
Tracers, Infusates, and Treatments aren't inserted, because it links to those 
newly created records.  And there's no way to detect problems in those new 
relationships in the unchanged database if they aren't inserted.  That's what 
doing this all in a transaction, and actually doing the inserts (for which I 
use Django `get_or_create` method calls) provides.

In other words, I would have to save and explicitly check the inter-related sheet data in 
data structures independent of the database in order to find the equivalent of (for 
example) `ObjectDoesNotExist` errors that originate from the database.  Right now, I get 
those errors caught "for free".  All I have to do is tell the user what 
sheet/row/column is related to that error.  And it saves me the overhead of having to 
maintain synchronicity between separate validation code and loading code when the loading 
code changes.

Seems to me this could be dealt with using a schema named validate that contains 'shadow' tables of those in the live schema(s). Import into their and see what fails.


Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544


--
Adrian Klaver
adrian.kla...@aklaver.com



Reply via email to