On Sep 9, 12:49 am, Abhishek Pratap <abhishek....@gmail.com> wrote: > Hi Guys > > My experience with python is 2 days and I am looking for a slick way > to use multi-threading to process a file. Here is what I would like to > do which is somewhat similar to MapReduce in concept. > > # test case > > 1. My input file is 10 GB. > 2. I want to open 10 file handles each handling 1 GB of the file > 3. Each file handle is processed in by an individual thread using the > same function ( so total 10 cores are assumed to be available on the > machine) > 4. There will be 10 different output files > 5. once the 10 jobs are complete a reduce kind of function will > combine the output. > > Could you give some ideas ?
You can use "multiprocessing" module instead of thread to bypass the GIL limitation. First cut your file in 10 "equal" parts. If it is line based search for the first line close to the cut. Be sure to have "start" and "end" for each parts, start is the address of the first character of the first line and end is one line too much (== start of the next block) Then use this function to handle each part . def handle(filename, start, end) f=open(filename) f.seek(start) for l in f: start+=len(l) if start>=end: break # handle line l here print l Do it first in a single process/thread to be sure this is ok (easier to debug) then split in multi processes > > So given a file I would like to read it in #N chunks through #N file > handles and process each of them separately. > > Best, > -Abhi -- http://mail.python.org/mailman/listinfo/python-list