[web2py] Re: IMPORTANT on cron jobs, scheduled jobs and delayed jobs

Niphlod Thu, 11 Aug 2011 16:19:25 -0700

scheduler.py are a super-arci-mega-ultra-extra nice 400 lines of code
that I like...


right now I'm beginning to test it, and the basics are quite
understandable at a first glance.

Before doing anything advanced, I fired up 8 workers and write a
controller putting an insanely high number of small function...noticed
a little hang on next_task() and think I found the resolution, that
doesn't need a big change... line 224

return
db(query).select(orderby=db.task_scheduled.next_run_time).first()

selects all rows, serialize it and then take only the first one. with
100000 records on task_scheduled table, and 8 workers polling, it
becomes a little heavy. Transforming that line in

return db(query).select(limitby=(0,1),
orderby=db.task_scheduled.next_run_time).first()

made it run (obviuosly) smoother.

I'll be happy to fully test this new scheduler because right now I'm
forced to launch a relatively long function in a controller with ajax
call.... really suboptimal.

Supposedly I'd like to have several tasks with no repetitions, so
several task_scheduled going to create a lot of task_run.
I need to save some data to the application database in the function
itself, so task_run isn't really needed.
I'd need to use the cleanup functions in scheduler.py, but I didn't
get how to add them using the

scheduler = Scheduler(db,dict(demo1=demo1,demo2=demo2))

syntax.

Also, I didn't get how to start scheduler.py in "standalone" mode,
specifically creating the tasks.py file ... any hint ?

PS: on line 47 of usage "id =
scheduler.db.tast_scheduler.insert(....)" is really "id =
scheduler.db.task_scheduled.insert(....)"

One question (here in Italy I'd say "non voglio rompere le uova nel
paniere", it seems that in english the correct translation is "I don't
want to to cut the ground from under ") ...
using sqlite is a mess of "operational error: Database is locked" with
many workers, so I went to test it with Postgres....
One thing that blocked me from writing an async queue and the relative
worker on a database was that, in scheduler.py terms, next_task()
fired from different workers fetch the same record.
While this particular occasion is rare, for some operations (e.g.:
send out a single mail once a day to a user, or having to schedule a
function that takes a loooooong time to execute) is a pain in the ass.
With one worker all is going fine, obviously.
That's why redis or rabbitMQ (just to name the first two coming up on
google searches) are used to store scheduled tasks: they are designed
to pull the record off the task_scheduled istantly, assuring that the
task is effectively completed only once.
So, here's what I thought (and I'll try to reproduce as soon as
possible):

1. next_task() fetches a task_scheduled record
2. then a record is inserted into task_run
3. then task_scheduled record gets updated
4. there's a db.commit()
5. function runs
etc etc

following updates and commits are not important to this matter.
>From 1. to just before 4., if another worker pulls a task with
next_task(), the task fetched is the same.
Relational databases comes to help, that's because smart Massimo put a
reference in task_run to task_scheduled, so, for example, if two
workers fetch the same record, they are not allowed at database level
to start working, because the db.commit() on 4. works for the first
but not for the second worker.
Second worker will crash, and it'll stop working.

If this assumption is right, and this is not a total insane rambling
will it be safer to catch that exception and continue working,
fetching another record, instead of crashing ?

I came up also with different ideas for a workaround, but I'll stop
here if the upper mentioned part is actually insane :D

[web2py] Re: IMPORTANT on cron jobs, scheduled jobs and delayed jobs

Reply via email to