On Tue, Apr 26, 2011 at 12:55 PM, Hans Georg Schaathun <ge...@schaathun.net> wrote: > I wonder if anyone has any experience with this ... > > I try to set up a simple client-server system to do some number > crunching, using a simple ad hoc protocol over TCP/IP. I use > two Queue objects on the server side to manage the input and the output > of the client process. A basic system running seemingly fine on a single > quad-core box was surprisingly simple to set up, and it seems to give > me a reasonable speed-up of a factor of around 3-3.5 using four client > processes in addition to the master process. (If anyone wants more > details, please ask.) > > Now, I would like to use remote hosts as well, more precisely, student > lab boxen which are rather unreliable. By experience I'd expect to > lose roughly 4-5 jobs in 100 CPU hours on average. Thus I need some > way of detecting lost connections and requeue unfinished tasks, > avoiding any serious delays in this detection. What is the best way to > do this in python? > > It is, of course, possible for the master thread upon processing the > results, to requeue the tasks for any missing results, but it seems > to me to be a cleaner solution if I could detect disconnects and > requeue the tasks from the networking threads. Is that possible > using python sockets? > > Somebody will probably ask why I am not using one of the multiprocessing > libraries. I have tried at least two, and got trapped by the overhead > of passing complex pickled objects across. Doing it myself has at least > helped me clarify what can be parallelised effectively. Now, > understanding the parallelisable subproblems better, I could try again, > if I can trust that these libraries can robustly handle lost clients. > That I don't know if I can.
You probably should assign a unique identifier to each piece of work, and implement two timeouts - one on your socket, using select or poll or similar, and one for the pieces of work based on the identifier. http://gengnosis.blogspot.com/2007/01/level-triggered-and-edge-triggered.html -- http://mail.python.org/mailman/listinfo/python-list