Although there are many places where multiprocess'ing could be handy
and efficient, unfortunately string matching is one of those things
that is almost entirely I/O bound.  With I/O bound tasks (particularly
the kind of processing you showed in your example code), you'll be
spending over 90% of your time waiting for the disk to supply the
data.  A couple of characteristics of these kinds of tasks:

* You will get essentially zero tangible total performance improvement
if you have a single hard drive whether you're running single threaded
on a single processor, or 500,000 processes on a super-computer --
it'll all get completed in about the same number of seconds either way
(probably saving a little time going single-threaded).
* On python, I/O bound tasks complete in about the same amount of time
as the equivalent code written in pure ANSI C (see
http://www.pytips.com/2010/5/29/a-quick-md5sum-equivalent-in-python --
take the exact timings there with a grain of salt, but it's a pretty
good real-world example of what you'll see).

So what I would do in your exact situation is to make the equivalent
to strings in pure python (the overhead of calling an external process
many times definitely will be noticeable), and instead just do it with
at most 2 threads (I would go single threaded and only notice about an
estimated 2% increase in the total time required to complete all
processing).

On Aug 20, 6:01 am, Phyo Arkar <phyo.arkarl...@gmail.com> wrote:
> well
>
> lets say i have about a thounsand files to be proccessed  .. i need to
> extract text out of them , whatever file type it is (i use Linux
> "strings") command .
>
> i want to do in multi processed way , which works on multi-core pcs too.
>
> this is my current implementation :
>
> import subprocess,shlex
>
> def __forcedParsing(fname):
>         cmd = 'strings "%s"' % (fname)
>         #print cmd
>         args= shlex.split(cmd)
>         try:
>                 sp = subprocess.Popen( args, shell = False, stdout =
> subprocess.PIPE, stderr = subprocess.PIPE )
>                 out, err = sp.communicate()
>         except OSError:
>                 print "Error no %s  Message %s" % 
> (OSError.errno,OSError.message)
>                 pass
>
>         if sp.returncode== 0:
>                 #print "Processed %s" %fname
>                 return out
>
> def parseDocs():
>         rows_to_parse = [i for i in range( 0,len(SESSION.all_docs))]
>         row_ids = [x[0] for x in SESSION.all_docs  ]
>         res=[]
>         for rowID in rows_to_parse:
>
>                 file_id, fname, ftype, dir  = SESSION.all_docs[int( rowID ) ]
>                 fp = os.path.join( dir, fname )
>                 res.append(__forcedParsing(fp))
>
> well the problem is i need output from subprocess so i have to read
> using sp.communicate(). i need that to be multiprocessed (via forking?
> poll?)
>
> so here are my thoughs :
>
> 1) without using fork() ,  could I  do multiple ajax posts by
> iterating the huge list of files at client side to server   , each
> processes will be multi-threaded because of Rocket right? But may this
> suffer performace issue on client side?
>
> 2) Forking Current implementation, and read output via polling?
> subprocess.poll()
>
> any ideas?

Reply via email to