[web2py] Re: Best practice using scheduler as a task queue?

ptressel Wed, 27 Jun 2012 17:02:30 -0700

Michael --

Here's a common scenario. I'm looking for the best implementation using the 
> scheduler.
>
> I want to support a set of background tasks (task1, task2...), where each 
> task:
>   • processes a queue of items
>   • waits a few seconds
>
> It's safe to have task1 and task2 running in parallel, but I cannot have 
> two task1s running in parallel. They will duplicately process the same 
> queue of items 
>
....
>
> So how can I ensure there is always EXACTLY ONE of each task in the 
> database?
>


This won't solve your installation / setup issue, but I wonder if it would 
help with the overrun and timeout problems...  Instead of scheduling a 
periodic task, what about having the task reschedule itself?  When it's 
done with the queue, schedule itself for later.  Remove the time limit so 
it can take whatever time it needs to finish the queue.  Or maybe launch a 
process on startup outside of the scheduler -- when it exhausts the queue, 
have it sleep and either wake periodically to check the queue, or have it 
waked when something is inserted.

Is the transaction processing issue you encountered with PostgreSQL 
preventing you from setting up your queue as a real producer consumer 
queue, where you could have multiple workers?

Re. inserting tasks only once:  We have a "first run" check in our models 
to assure that setup code only runs once -- this only runs if the database 
is empty -- but that's not adequate if you update code on a running system 
and add a new task.  We added an "update check" using a version number -- 
we write a breadcrumb file into the models directory with the current 
version, and then check that against a version in the code that is changed 
by the developers when some update code needs to run or the site needs to 
take some action -- you might do something like that to insert new tasks 
just once.  (Details:  The breadcrumb file is named so it's run first 
before other models, and contains one statement that sets a global with the 
version number found during the previous models run.  The first "real" 
model file compares that last version against the current version.  If the 
breadcrumb file didn't exist or the version is different, it runs some 
update code and rewrites the breadcrumb file.  IIRC we open the breadcrumb 
file for exclusive access and spin if it's locked -- will need to make sure 
I did that...)

I don't think this would help with your case, but will mention...  I'm 
working on chaining scheduler tasks -- letting one task conditionally 
release held tasks or insert new ones.  Our need was different from yours 
-- we didn't know which task(s) we wanted to run until we read remote data 
(via a task for that purpose).  So our reader task fetches the data, 
figures out what needs to run and puts work in queues, releases previously 
scheduled tasks.  Since this mod made changes like having unique names for 
all tasks independent of the task function, there may be some issues with 
having a task reschedule itself using an unmodified scheduler that I'm not 
thinking of.

As an aside, there's always a problem with processing items in a queue (at 
least if the items are consumable rather than a persistent to-do list) 
namely, how do you assure that each item is completely processed, and the 
work within one item gets done only once, if the worker processing them 
might fail in the middle of processing an item?  If the worker takes the 
item out of the queue before starting work, then the item is lost if the 
worker dies.  If it leaves the item in the queue but marks it as being 
worked on by itself, another worker can redo it, but encounters the issue 
of picking up where the previous one left off.  For a database, that might 
be solved with transactions and rollback (assuming that's working...).  
This isn't a problem with the scheduler per se -- it's a generic queue 
processing issue.

I'm probably missing some aspect of your situation, so let me say sorry! in 
advance if this isn't relevant.

-- Pat

[web2py] Re: Best practice using scheduler as a task queue?

Reply via email to