Hi All @Roy : split in unix sounds good but will it be as efficient as opening 10 different file handles on a file. I haven't tried it so just wondering if you have any experience with it.
Thanks for your input. Also I was not aware of the python's GIL limitation. My application is not I/O bound as far as I can understand it. Each line is read and then processed independently of each other. May be this might sound I/O intensive as #N files will be read but I think if I have 10 processes running under a parent then it might not be a bottle neck. Best, -Abhi On Fri, Sep 9, 2011 at 6:19 AM, Roy Smith <r...@panix.com> wrote: > In article > <c6cbd486-7e5e-4d26-93b9-088d48a25...@g9g2000yqb.googlegroups.com>, > aspineux <aspin...@gmail.com> wrote: > >> On Sep 9, 12:49 am, Abhishek Pratap <abhishek....@gmail.com> wrote: >> > 1. My input file is 10 GB. >> > 2. I want to open 10 file handles each handling 1 GB of the file >> > 3. Each file handle is processed in by an individual thread using the >> > same function ( so total 10 cores are assumed to be available on the >> > machine) >> > 4. There will be 10 different output files >> > 5. once the 10 jobs are complete a reduce kind of function will >> > combine the output. >> > >> > Could you give some ideas ? >> >> You can use "multiprocessing" module instead of thread to bypass the >> GIL limitation. > > I agree with this. > >> First cut your file in 10 "equal" parts. If it is line based search >> for the first line close to the cut. Be sure to have "start" and >> "end" for each parts, start is the address of the first character of >> the first line and end is one line too much (== start of the next >> block) > > How much of the total time will be I/O and how much actual processing? > Unless your processing is trivial, the I/O time will be relatively > small. In that case, you might do well to just use the unix > command-line "split" utility to split the file into pieces first, then > process the pieces in parallel. Why waste effort getting the > file-splitting-at-line-boundaries logic correct when somebody has done > it for you? > > -- > http://mail.python.org/mailman/listinfo/python-list > > -- http://mail.python.org/mailman/listinfo/python-list