Phyo,

I agree mostly with what Kevin says, but some multiprocessing could be
good for the case, unless "strings" is faster than IO.
Since the specific problem is not web specific, I suggest that you do
create a program in python (without web2py) and,
 as Kevin says, better if you replace "strings" with some python
library function (if it is possible).
The program should handle with a master process, and as Massimo
suggests, its pool of parallel children tasks
 (I suggest fork if you are on linux for sure), probably no more than
the number of cpus/cores available, you must tune that.
  The master process should be able to respond to status request and
eventually handle a clean shutdown through some handle.
Once you have your plain python program functioning, make it a module
and import it in a web2py controller.  You should then be able
to poll the master process (fork and  keep the hanlde reference in
session) in a web2py controller so you can use a nice jquery widget to
show progress.

:-)

mic

2010/8/26 Kevin <extemporalgen...@gmail.com>:
> Although there are many places where multiprocess'ing could be handy
> and efficient, unfortunately string matching is one of those things
> that is almost entirely I/O bound.  With I/O bound tasks (particularly
> the kind of processing you showed in your example code), you'll be
> spending over 90% of your time waiting for the disk to supply the
> data.  A couple of characteristics of these kinds of tasks:
>
> * You will get essentially zero tangible total performance improvement
> if you have a single hard drive whether you're running single threaded
> on a single processor, or 500,000 processes on a super-computer --
> it'll all get completed in about the same number of seconds either way
> (probably saving a little time going single-threaded).
> * On python, I/O bound tasks complete in about the same amount of time
> as the equivalent code written in pure ANSI C (see
> http://www.pytips.com/2010/5/29/a-quick-md5sum-equivalent-in-python --
> take the exact timings there with a grain of salt, but it's a pretty
> good real-world example of what you'll see).
>
> So what I would do in your exact situation is to make the equivalent
> to strings in pure python (the overhead of calling an external process
> many times definitely will be noticeable), and instead just do it with
> at most 2 threads (I would go single threaded and only notice about an
> estimated 2% increase in the total time required to complete all
> processing).
>
> On Aug 20, 6:01 am, Phyo Arkar <phyo.arkarl...@gmail.com> wrote:
>> well
>>
>> lets say i have about a thounsand files to be proccessed  .. i need to
>> extract text out of them , whatever file type it is (i use Linux
>> "strings") command .
>>
>> i want to do in multi processed way , which works on multi-core pcs too.
>>
>> this is my current implementation :
>>
>> import subprocess,shlex
>>
>> def __forcedParsing(fname):
>>         cmd = 'strings "%s"' % (fname)
>>         #print cmd
>>         args= shlex.split(cmd)
>>         try:
>>                 sp = subprocess.Popen( args, shell = False, stdout =
>> subprocess.PIPE, stderr = subprocess.PIPE )
>>                 out, err = sp.communicate()
>>         except OSError:
>>                 print "Error no %s  Message %s" % 
>> (OSError.errno,OSError.message)
>>                 pass
>>
>>         if sp.returncode== 0:
>>                 #print "Processed %s" %fname
>>                 return out
>>
>> def parseDocs():
>>         rows_to_parse = [i for i in range( 0,len(SESSION.all_docs))]
>>         row_ids = [x[0] for x in SESSION.all_docs  ]
>>         res=[]
>>         for rowID in rows_to_parse:
>>
>>                 file_id, fname, ftype, dir  = SESSION.all_docs[int( rowID ) ]
>>                 fp = os.path.join( dir, fname )
>>                 res.append(__forcedParsing(fp))
>>
>> well the problem is i need output from subprocess so i have to read
>> using sp.communicate(). i need that to be multiprocessed (via forking?
>> poll?)
>>
>> so here are my thoughs :
>>
>> 1) without using fork() ,  could I  do multiple ajax posts by
>> iterating the huge list of files at client side to server   , each
>> processes will be multi-threaded because of Rocket right? But may this
>> suffer performace issue on client side?
>>
>> 2) Forking Current implementation, and read output via polling?
>> subprocess.poll()
>>
>> any ideas?

Reply via email to