Yes i do use forks before in C and Python . But i have not tried with web2py yet , i dont know how web2py will behave . Forks works well on web2py?
if not i will just write a twisted base daemon, which will listen on a unix ipc socket for requests and if the crawling is done , will report "number_of_processed_files" back to web2py. it can just call back directly to a web2py controller's function which waits for it using urlib. Do not know what kind of files you are indexing maybe look at these: > http://arcvback.com/python-indexing.html > what i am doing now is trying to parse any kind of files. well lets say it is a crawler (not and indexer) yet. it crawls and insert into database. for indexer i already using sphinx. Fastest one ever,from my tests. On Sat, Aug 28, 2010 at 2:35 AM, Michele Comitini < michele.comit...@gmail.com> wrote: > 2010/8/27 Phyo Arkar <phyo.arkarl...@gmail.com>: > > strings is very neat unix/linux command to extract Strings (with more > than 4 > > chars by default) inside any type of files (even binary files, images , > > etc). > > so if there a python implemantion of it , if u know i will use it. as > Kevin > > shows theres not much speed difference in IO compare to C. (even faster > hmmm > > , but not every case i guess) . > I know strings, very useful ;-). > > > > > If not as michel suggested , i will fork. But forking inside web2py , > will > > it work? You mean outside of web2py ? it will need IPC/Socks to > communicate > > between , well i can do it in twisted but that really necessary? > > > Yes I mean forking which means the process goes out of web2py flow of > control. > twisted may be an option, but also below and choose what sounds better: > http://wiki.python.org/moin/ParallelProcessing > > Forget about java threading stuff, you will see that > it is much easier to do parallel programming with Python, enjoy! :-) > > > Oh another thing , the indexer (as soon as index is finished , it just > put > > inside db and only repond with done) so your suggestion , to make master > > process and to poll will work. > > You can make a list of files to be processed, compute the total size > and have the master process report percent of job done... > Do not know what kind of files you are indexing maybe look at these: > http://arcvback.com/python-indexing.html > > > > > On Fri, Aug 27, 2010 at 4:08 AM, Michele Comitini > > <michele.comit...@gmail.com> wrote: > >> > >> Phyo, > >> > >> I agree mostly with what Kevin says, but some multiprocessing could be > >> good for the case, unless "strings" is faster than IO. > >> Since the specific problem is not web specific, I suggest that you do > >> create a program in python (without web2py) and, > >> as Kevin says, better if you replace "strings" with some python > >> library function (if it is possible). > >> The program should handle with a master process, and as Massimo > >> suggests, its pool of parallel children tasks > >> (I suggest fork if you are on linux for sure), probably no more than > >> the number of cpus/cores available, you must tune that. > >> The master process should be able to respond to status request and > >> eventually handle a clean shutdown through some handle. > >> Once you have your plain python program functioning, make it a module > >> and import it in a web2py controller. You should then be able > >> to poll the master process (fork and keep the hanlde reference in > >> session) in a web2py controller so you can use a nice jquery widget to > >> show progress. > >> > >> :-) > >> > >> mic > >> > >> 2010/8/26 Kevin <extemporalgen...@gmail.com>: > >> > Although there are many places where multiprocess'ing could be handy > >> > and efficient, unfortunately string matching is one of those things > >> > that is almost entirely I/O bound. With I/O bound tasks (particularly > >> > the kind of processing you showed in your example code), you'll be > >> > spending over 90% of your time waiting for the disk to supply the > >> > data. A couple of characteristics of these kinds of tasks: > >> > > >> > * You will get essentially zero tangible total performance improvement > >> > if you have a single hard drive whether you're running single threaded > >> > on a single processor, or 500,000 processes on a super-computer -- > >> > it'll all get completed in about the same number of seconds either way > >> > (probably saving a little time going single-threaded). > >> > * On python, I/O bound tasks complete in about the same amount of time > >> > as the equivalent code written in pure ANSI C (see > >> > http://www.pytips.com/2010/5/29/a-quick-md5sum-equivalent-in-python-- > >> > take the exact timings there with a grain of salt, but it's a pretty > >> > good real-world example of what you'll see). > >> > > >> > So what I would do in your exact situation is to make the equivalent > >> > to strings in pure python (the overhead of calling an external process > >> > many times definitely will be noticeable), and instead just do it with > >> > at most 2 threads (I would go single threaded and only notice about an > >> > estimated 2% increase in the total time required to complete all > >> > processing). > >> > > >> > On Aug 20, 6:01 am, Phyo Arkar <phyo.arkarl...@gmail.com> wrote: > >> >> well > >> >> > >> >> lets say i have about a thounsand files to be proccessed .. i need > to > >> >> extract text out of them , whatever file type it is (i use Linux > >> >> "strings") command . > >> >> > >> >> i want to do in multi processed way , which works on multi-core pcs > >> >> too. > >> >> > >> >> this is my current implementation : > >> >> > >> >> import subprocess,shlex > >> >> > >> >> def __forcedParsing(fname): > >> >> cmd = 'strings "%s"' % (fname) > >> >> #print cmd > >> >> args= shlex.split(cmd) > >> >> try: > >> >> sp = subprocess.Popen( args, shell = False, stdout = > >> >> subprocess.PIPE, stderr = subprocess.PIPE ) > >> >> out, err = sp.communicate() > >> >> except OSError: > >> >> print "Error no %s Message %s" % > >> >> (OSError.errno,OSError.message) > >> >> pass > >> >> > >> >> if sp.returncode== 0: > >> >> #print "Processed %s" %fname > >> >> return out > >> >> > >> >> def parseDocs(): > >> >> rows_to_parse = [i for i in range( 0,len(SESSION.all_docs))] > >> >> row_ids = [x[0] for x in SESSION.all_docs ] > >> >> res=[] > >> >> for rowID in rows_to_parse: > >> >> > >> >> file_id, fname, ftype, dir = SESSION.all_docs[int( > >> >> rowID ) ] > >> >> fp = os.path.join( dir, fname ) > >> >> res.append(__forcedParsing(fp)) > >> >> > >> >> well the problem is i need output from subprocess so i have to read > >> >> using sp.communicate(). i need that to be multiprocessed (via > forking? > >> >> poll?) > >> >> > >> >> so here are my thoughs : > >> >> > >> >> 1) without using fork() , could I do multiple ajax posts by > >> >> iterating the huge list of files at client side to server , each > >> >> processes will be multi-threaded because of Rocket right? But may > this > >> >> suffer performace issue on client side? > >> >> > >> >> 2) Forking Current implementation, and read output via polling? > >> >> subprocess.poll() > >> >> > >> >> any ideas? > > > > >