2010/8/27 Phyo Arkar <phyo.arkarl...@gmail.com>:
> strings is very neat unix/linux command to extract Strings (with more than 4
> chars by default)  inside any type of files (even binary files, images ,
> etc).
> so if there a python implemantion of it , if u know i will use it. as Kevin
> shows theres not much speed difference in IO compare to C. (even faster hmmm
> , but not every case i guess) .
I know strings, very useful ;-).

>
> If not as michel suggested , i will fork. But forking inside web2py , will
> it work? You mean outside of web2py ? it will need IPC/Socks to communicate
> between , well i can do it in twisted but that really necessary?
>
Yes I mean forking which means the process goes out of web2py flow of control.
twisted may be an option, but  also below and choose what sounds better:
http://wiki.python.org/moin/ParallelProcessing

  Forget about java threading stuff, you will see that
it is much easier to do parallel programming with Python, enjoy! :-)

> Oh another thing , the indexer (as soon as index is finished , it just put
> inside db and only repond with done) so your suggestion , to make master
> process and to poll will work.

You can make a list of files to be processed, compute the total size
and have the master process report percent of job done...
Do not know what kind of files you are indexing maybe look at these:
http://arcvback.com/python-indexing.html

>
> On Fri, Aug 27, 2010 at 4:08 AM, Michele Comitini
> <michele.comit...@gmail.com> wrote:
>>
>> Phyo,
>>
>> I agree mostly with what Kevin says, but some multiprocessing could be
>> good for the case, unless "strings" is faster than IO.
>> Since the specific problem is not web specific, I suggest that you do
>> create a program in python (without web2py) and,
>>  as Kevin says, better if you replace "strings" with some python
>> library function (if it is possible).
>> The program should handle with a master process, and as Massimo
>> suggests, its pool of parallel children tasks
>>  (I suggest fork if you are on linux for sure), probably no more than
>> the number of cpus/cores available, you must tune that.
>>  The master process should be able to respond to status request and
>> eventually handle a clean shutdown through some handle.
>> Once you have your plain python program functioning, make it a module
>> and import it in a web2py controller.  You should then be able
>> to poll the master process (fork and  keep the hanlde reference in
>> session) in a web2py controller so you can use a nice jquery widget to
>> show progress.
>>
>> :-)
>>
>> mic
>>
>> 2010/8/26 Kevin <extemporalgen...@gmail.com>:
>> > Although there are many places where multiprocess'ing could be handy
>> > and efficient, unfortunately string matching is one of those things
>> > that is almost entirely I/O bound.  With I/O bound tasks (particularly
>> > the kind of processing you showed in your example code), you'll be
>> > spending over 90% of your time waiting for the disk to supply the
>> > data.  A couple of characteristics of these kinds of tasks:
>> >
>> > * You will get essentially zero tangible total performance improvement
>> > if you have a single hard drive whether you're running single threaded
>> > on a single processor, or 500,000 processes on a super-computer --
>> > it'll all get completed in about the same number of seconds either way
>> > (probably saving a little time going single-threaded).
>> > * On python, I/O bound tasks complete in about the same amount of time
>> > as the equivalent code written in pure ANSI C (see
>> > http://www.pytips.com/2010/5/29/a-quick-md5sum-equivalent-in-python --
>> > take the exact timings there with a grain of salt, but it's a pretty
>> > good real-world example of what you'll see).
>> >
>> > So what I would do in your exact situation is to make the equivalent
>> > to strings in pure python (the overhead of calling an external process
>> > many times definitely will be noticeable), and instead just do it with
>> > at most 2 threads (I would go single threaded and only notice about an
>> > estimated 2% increase in the total time required to complete all
>> > processing).
>> >
>> > On Aug 20, 6:01 am, Phyo Arkar <phyo.arkarl...@gmail.com> wrote:
>> >> well
>> >>
>> >> lets say i have about a thounsand files to be proccessed  .. i need to
>> >> extract text out of them , whatever file type it is (i use Linux
>> >> "strings") command .
>> >>
>> >> i want to do in multi processed way , which works on multi-core pcs
>> >> too.
>> >>
>> >> this is my current implementation :
>> >>
>> >> import subprocess,shlex
>> >>
>> >> def __forcedParsing(fname):
>> >>         cmd = 'strings "%s"' % (fname)
>> >>         #print cmd
>> >>         args= shlex.split(cmd)
>> >>         try:
>> >>                 sp = subprocess.Popen( args, shell = False, stdout =
>> >> subprocess.PIPE, stderr = subprocess.PIPE )
>> >>                 out, err = sp.communicate()
>> >>         except OSError:
>> >>                 print "Error no %s  Message %s" %
>> >> (OSError.errno,OSError.message)
>> >>                 pass
>> >>
>> >>         if sp.returncode== 0:
>> >>                 #print "Processed %s" %fname
>> >>                 return out
>> >>
>> >> def parseDocs():
>> >>         rows_to_parse = [i for i in range( 0,len(SESSION.all_docs))]
>> >>         row_ids = [x[0] for x in SESSION.all_docs  ]
>> >>         res=[]
>> >>         for rowID in rows_to_parse:
>> >>
>> >>                 file_id, fname, ftype, dir  = SESSION.all_docs[int(
>> >> rowID ) ]
>> >>                 fp = os.path.join( dir, fname )
>> >>                 res.append(__forcedParsing(fp))
>> >>
>> >> well the problem is i need output from subprocess so i have to read
>> >> using sp.communicate(). i need that to be multiprocessed (via forking?
>> >> poll?)
>> >>
>> >> so here are my thoughs :
>> >>
>> >> 1) without using fork() ,  could I  do multiple ajax posts by
>> >> iterating the huge list of files at client side to server   , each
>> >> processes will be multi-threaded because of Rocket right? But may this
>> >> suffer performace issue on client side?
>> >>
>> >> 2) Forking Current implementation, and read output via polling?
>> >> subprocess.poll()
>> >>
>> >> any ideas?
>
>

Reply via email to