Yes i do use forks before in C and Python .

But i have not tried with web2py yet , i dont know how web2py will behave .
Forks works well on web2py?

if not i will just write a twisted base daemon, which will listen on a unix
ipc socket for requests and if the crawling is done , will report
"number_of_processed_files" back to web2py. it can just call back directly
to a web2py controller's function which waits for it using urlib.

Do not know what kind of files you are indexing maybe look at these:
> http://arcvback.com/python-indexing.html
>

what i am doing now is  trying to parse any kind of files. well lets say it
is a crawler (not and indexer) yet. it crawls and insert into database.

for indexer i already using sphinx. Fastest one ever,from my tests.

On Sat, Aug 28, 2010 at 2:35 AM, Michele Comitini <
michele.comit...@gmail.com> wrote:

> 2010/8/27 Phyo Arkar <phyo.arkarl...@gmail.com>:
> > strings is very neat unix/linux command to extract Strings (with more
> than 4
> > chars by default)  inside any type of files (even binary files, images ,
> > etc).
> > so if there a python implemantion of it , if u know i will use it. as
> Kevin
> > shows theres not much speed difference in IO compare to C. (even faster
> hmmm
> > , but not every case i guess) .
> I know strings, very useful ;-).
>
> >
> > If not as michel suggested , i will fork. But forking inside web2py ,
> will
> > it work? You mean outside of web2py ? it will need IPC/Socks to
> communicate
> > between , well i can do it in twisted but that really necessary?
> >
> Yes I mean forking which means the process goes out of web2py flow of
> control.
> twisted may be an option, but  also below and choose what sounds better:
> http://wiki.python.org/moin/ParallelProcessing
>
>  Forget about java threading stuff, you will see that
> it is much easier to do parallel programming with Python, enjoy! :-)
>
> > Oh another thing , the indexer (as soon as index is finished , it just
> put
> > inside db and only repond with done) so your suggestion , to make master
> > process and to poll will work.
>
> You can make a list of files to be processed, compute the total size
> and have the master process report percent of job done...
> Do not know what kind of files you are indexing maybe look at these:
> http://arcvback.com/python-indexing.html
>
> >
> > On Fri, Aug 27, 2010 at 4:08 AM, Michele Comitini
> > <michele.comit...@gmail.com> wrote:
> >>
> >> Phyo,
> >>
> >> I agree mostly with what Kevin says, but some multiprocessing could be
> >> good for the case, unless "strings" is faster than IO.
> >> Since the specific problem is not web specific, I suggest that you do
> >> create a program in python (without web2py) and,
> >>  as Kevin says, better if you replace "strings" with some python
> >> library function (if it is possible).
> >> The program should handle with a master process, and as Massimo
> >> suggests, its pool of parallel children tasks
> >>  (I suggest fork if you are on linux for sure), probably no more than
> >> the number of cpus/cores available, you must tune that.
> >>  The master process should be able to respond to status request and
> >> eventually handle a clean shutdown through some handle.
> >> Once you have your plain python program functioning, make it a module
> >> and import it in a web2py controller.  You should then be able
> >> to poll the master process (fork and  keep the hanlde reference in
> >> session) in a web2py controller so you can use a nice jquery widget to
> >> show progress.
> >>
> >> :-)
> >>
> >> mic
> >>
> >> 2010/8/26 Kevin <extemporalgen...@gmail.com>:
> >> > Although there are many places where multiprocess'ing could be handy
> >> > and efficient, unfortunately string matching is one of those things
> >> > that is almost entirely I/O bound.  With I/O bound tasks (particularly
> >> > the kind of processing you showed in your example code), you'll be
> >> > spending over 90% of your time waiting for the disk to supply the
> >> > data.  A couple of characteristics of these kinds of tasks:
> >> >
> >> > * You will get essentially zero tangible total performance improvement
> >> > if you have a single hard drive whether you're running single threaded
> >> > on a single processor, or 500,000 processes on a super-computer --
> >> > it'll all get completed in about the same number of seconds either way
> >> > (probably saving a little time going single-threaded).
> >> > * On python, I/O bound tasks complete in about the same amount of time
> >> > as the equivalent code written in pure ANSI C (see
> >> > http://www.pytips.com/2010/5/29/a-quick-md5sum-equivalent-in-python--
> >> > take the exact timings there with a grain of salt, but it's a pretty
> >> > good real-world example of what you'll see).
> >> >
> >> > So what I would do in your exact situation is to make the equivalent
> >> > to strings in pure python (the overhead of calling an external process
> >> > many times definitely will be noticeable), and instead just do it with
> >> > at most 2 threads (I would go single threaded and only notice about an
> >> > estimated 2% increase in the total time required to complete all
> >> > processing).
> >> >
> >> > On Aug 20, 6:01 am, Phyo Arkar <phyo.arkarl...@gmail.com> wrote:
> >> >> well
> >> >>
> >> >> lets say i have about a thounsand files to be proccessed  .. i need
> to
> >> >> extract text out of them , whatever file type it is (i use Linux
> >> >> "strings") command .
> >> >>
> >> >> i want to do in multi processed way , which works on multi-core pcs
> >> >> too.
> >> >>
> >> >> this is my current implementation :
> >> >>
> >> >> import subprocess,shlex
> >> >>
> >> >> def __forcedParsing(fname):
> >> >>         cmd = 'strings "%s"' % (fname)
> >> >>         #print cmd
> >> >>         args= shlex.split(cmd)
> >> >>         try:
> >> >>                 sp = subprocess.Popen( args, shell = False, stdout =
> >> >> subprocess.PIPE, stderr = subprocess.PIPE )
> >> >>                 out, err = sp.communicate()
> >> >>         except OSError:
> >> >>                 print "Error no %s  Message %s" %
> >> >> (OSError.errno,OSError.message)
> >> >>                 pass
> >> >>
> >> >>         if sp.returncode== 0:
> >> >>                 #print "Processed %s" %fname
> >> >>                 return out
> >> >>
> >> >> def parseDocs():
> >> >>         rows_to_parse = [i for i in range( 0,len(SESSION.all_docs))]
> >> >>         row_ids = [x[0] for x in SESSION.all_docs  ]
> >> >>         res=[]
> >> >>         for rowID in rows_to_parse:
> >> >>
> >> >>                 file_id, fname, ftype, dir  = SESSION.all_docs[int(
> >> >> rowID ) ]
> >> >>                 fp = os.path.join( dir, fname )
> >> >>                 res.append(__forcedParsing(fp))
> >> >>
> >> >> well the problem is i need output from subprocess so i have to read
> >> >> using sp.communicate(). i need that to be multiprocessed (via
> forking?
> >> >> poll?)
> >> >>
> >> >> so here are my thoughs :
> >> >>
> >> >> 1) without using fork() ,  could I  do multiple ajax posts by
> >> >> iterating the huge list of files at client side to server   , each
> >> >> processes will be multi-threaded because of Rocket right? But may
> this
> >> >> suffer performace issue on client side?
> >> >>
> >> >> 2) Forking Current implementation, and read output via polling?
> >> >> subprocess.poll()
> >> >>
> >> >> any ideas?
> >
> >
>

Reply via email to