Phyo, I agree mostly with what Kevin says, but some multiprocessing could be good for the case, unless "strings" is faster than IO. Since the specific problem is not web specific, I suggest that you do create a program in python (without web2py) and, as Kevin says, better if you replace "strings" with some python library function (if it is possible). The program should handle with a master process, and as Massimo suggests, its pool of parallel children tasks (I suggest fork if you are on linux for sure), probably no more than the number of cpus/cores available, you must tune that. The master process should be able to respond to status request and eventually handle a clean shutdown through some handle. Once you have your plain python program functioning, make it a module and import it in a web2py controller. You should then be able to poll the master process (fork and keep the hanlde reference in session) in a web2py controller so you can use a nice jquery widget to show progress.
:-) mic 2010/8/26 Kevin <extemporalgen...@gmail.com>: > Although there are many places where multiprocess'ing could be handy > and efficient, unfortunately string matching is one of those things > that is almost entirely I/O bound. With I/O bound tasks (particularly > the kind of processing you showed in your example code), you'll be > spending over 90% of your time waiting for the disk to supply the > data. A couple of characteristics of these kinds of tasks: > > * You will get essentially zero tangible total performance improvement > if you have a single hard drive whether you're running single threaded > on a single processor, or 500,000 processes on a super-computer -- > it'll all get completed in about the same number of seconds either way > (probably saving a little time going single-threaded). > * On python, I/O bound tasks complete in about the same amount of time > as the equivalent code written in pure ANSI C (see > http://www.pytips.com/2010/5/29/a-quick-md5sum-equivalent-in-python -- > take the exact timings there with a grain of salt, but it's a pretty > good real-world example of what you'll see). > > So what I would do in your exact situation is to make the equivalent > to strings in pure python (the overhead of calling an external process > many times definitely will be noticeable), and instead just do it with > at most 2 threads (I would go single threaded and only notice about an > estimated 2% increase in the total time required to complete all > processing). > > On Aug 20, 6:01 am, Phyo Arkar <phyo.arkarl...@gmail.com> wrote: >> well >> >> lets say i have about a thounsand files to be proccessed .. i need to >> extract text out of them , whatever file type it is (i use Linux >> "strings") command . >> >> i want to do in multi processed way , which works on multi-core pcs too. >> >> this is my current implementation : >> >> import subprocess,shlex >> >> def __forcedParsing(fname): >> cmd = 'strings "%s"' % (fname) >> #print cmd >> args= shlex.split(cmd) >> try: >> sp = subprocess.Popen( args, shell = False, stdout = >> subprocess.PIPE, stderr = subprocess.PIPE ) >> out, err = sp.communicate() >> except OSError: >> print "Error no %s Message %s" % >> (OSError.errno,OSError.message) >> pass >> >> if sp.returncode== 0: >> #print "Processed %s" %fname >> return out >> >> def parseDocs(): >> rows_to_parse = [i for i in range( 0,len(SESSION.all_docs))] >> row_ids = [x[0] for x in SESSION.all_docs ] >> res=[] >> for rowID in rows_to_parse: >> >> file_id, fname, ftype, dir = SESSION.all_docs[int( rowID ) ] >> fp = os.path.join( dir, fname ) >> res.append(__forcedParsing(fp)) >> >> well the problem is i need output from subprocess so i have to read >> using sp.communicate(). i need that to be multiprocessed (via forking? >> poll?) >> >> so here are my thoughs : >> >> 1) without using fork() , could I do multiple ajax posts by >> iterating the huge list of files at client side to server , each >> processes will be multi-threaded because of Rocket right? But may this >> suffer performace issue on client side? >> >> 2) Forking Current implementation, and read output via polling? >> subprocess.poll() >> >> any ideas?