>>> The load to the development server does no validation?
>>> 
>>> If so what is the purpose?
>>> 
>>> The background processes are other validation runs?
>> It's the same code that executes in both cases (with or without the 
>> `--validate` flag).  All that that flag does is it (effectively) raises the 
>> dry run exception before it leaves the transaction block, so it always 
>> validates (whether the flag is supplied or not).
> 
> More for my sake then anything else, why do the load to the development 
> server at all if the production load is the only one that counts?

The software is still in a new major version beta.  We're adding features and 
fixing bugs.  It's not unusual to encounter a new bug, fix it on dev to get the 
load to work, then deploy a point release on prod.  And that means repeated 
load attempts that interfere with the validation interface.  Besides, beyond 
this, we're planning on a separate staging database that dev effectively now 
is.  Sometimes, a curator only finds a technical data issue after the initial 
load while browsing the newly loaded data on the dev site.

>> So the load doesn't fail until the end of the run, which is inefficient from 
>> a maintenance perspective.  I've been thinking of adding a `--failfast` 
>> option for use on the back end.  Haven't done it yet.  I started a load 
>> yesterday in fact that ran 2 hours before it buffered an exception related 
>> to a newly introduced bug.  I fixed the bug and ran the load again.  It 
>> finished sometime between COB yesterday and this morning (successfully!).
> 
> Alright I am trying to reconcile this with from below, 'The largest studies 
> take just under a minute'.

The context of the 'The largest studies take just under a minute' statement is 
that it's not loading the hefty/time-consuming raw data.  It's only validating 
the metadata.  That's fast (5-60s).  And that data is a portion of the 
transaction in the back-end load.  There are errors that validation can miss 
that are due to not touching the raw data, and in fact, those errors are 
addressed by curators editing the excel sheets.  That's why it's all in the 
load transaction instead of loaded separately, but those problems are somewhat 
rare (and we currently have a new feature in the design phase that should 
almost completely eliminate those issues).

>>> Seems you are looking for some sort of queuing system.
>>> 
>>> What are the time constraints for getting the validation turned around.
>> I have considered a queuing system, though when I previously floated a proof 
>> of concept using celery, I was informed it was too much.  Though, at the 
>> time, all I was trying to do was a progress bar for a query stats feature.  
>> So proposing celery in this instance may get more traction with the rest of 
>> the team.
>> Most of the small validation processes finish in under a dozen seconds.   
>> The largest studies take just under a minute.  I have plans to optimize the 
>> loading scripts that hopefully could get the largest studies down to a dozen 
>> seconds.  If I could do that, and do the back end loads in off-peak hours, 
>> then I'd be willing to suffer the rare timeouts from concurrent validations. 
>>  The raw data loads will still likely take a much longer time.
> 
> This is where I get confused, probably because I am not exactly sure what 
> constitutes validation. My sense is that involves a load of data into live 
> tables and seeing what fails PK, FK or other constraints.
> 
> If that is the case I am not seeing how the 'for real' data load would be 
> longer?

The validation skips the time-consuming raw data load.  That raw data is 
collectively hundreds of gigs in size and could not be uploaded on the 
validation page anyway.  The feature I alluded to above that would make errors 
associated with the raw data almost completely eliminated is one where the 
researcher can drop the raw data folder into the form and it just walks the 
directory to get all the raw data file names and relative paths.  It's those 
data relationships whose validations are currently skipped.

> At any rate I can't see how loading into a live database multiple sets of 
> data while operations are going on in the database can be made conflict free. 
> To me  it seems the best that be done is:
> 
> 1) Reduce chance for conflict by spreading the actions out.
> 
> 2) Have retry logic that deals with conflicts.

I'm unfamiliar with retry functionality, but those options sound logical to me 
as a good path forward, particularly using celery to spread out validations and 
doing the back end loads at night (or using some sort of fast dump/load).  The 
thing that bothers me about the celery solution is that most of the time, 2 
users validating different data will not block, so I would be making users wait 
for no reason.  Ideally, I could anticipate the block and only at that point, 
separate those validations.

This brings up a question though about a possibility I suspect is not 
practical.  My initial read of the isolation levels documentation found this 
section really promising:

> The Repeatable Read isolation level only sees data committed before the 
> transaction began; it never sees either uncommitted data or changes committed 
> during transaction execution by concurrent transactions.

This was before I realized that the actions of the previously started 
transaction would include "locks" that would block validation even though the 
load transaction hasn't committed yet:

> a target row might have already been updated (or deleted or locked) by 
> another concurrent transaction by the time it is found. In this case, the 
> repeatable read transaction will wait for the first updating transaction to 
> commit or roll back

Other documentation I read referred to the state of the DB (when a transaction 
starts) as a "snapshot" and I thought... what if I could save such a snapshot 
automatically just before a back-end load starts, and use that snapshot for 
validation, such that my validation processes could use that to validate 
against and not encounter any locks?  The validation will never commit, so 
there's no risk.

I know Django's ORM wouldn't support that, but I kind of hoped that someone in 
this email list might suggest a snapshot functionality as a possible solution.  
Since the validations never commit, the only downside would be if the backend 
load changed something that introduces a problem with the validated data that 
would not be fixed until we actually attempt to load it.

Is that too science-fictiony of an idea?

Reply via email to