Thanks for the advice Dennis. @Steve : I haven't actually written the code. I was thinking more on the generic side and wanted to check if what I thought made sense and I now realize it can depend on then the I/O. For starters I was just thinking about counting lines in a line without doing any computation so this can be strictly I/O bound.
I guess what I need to ask was can we improve on the existing disk I/O performance by reading different portions of the file using threads or processes. I am kind of pointing towards a MapReduce task on a file in a shared file system such as GPFS(from IBM). I realize this can be more suited to HDFS but wanted to know if people have implemented something similar on a normal linux based NFS -Abhi On Mon, Mar 26, 2012 at 6:44 PM, Steve Howell <showel...@yahoo.com> wrote: > On Mar 26, 3:56 pm, Abhishek Pratap <abhishek....@gmail.com> wrote: >> Hi Guys >> >> I am fwding this question from the python tutor list in the hope of >> reaching more people experienced in concurrent disk access in python. >> >> I am trying to see if there are ways in which I can read a big file >> concurrently on a multi core server and process data and write the >> output to a single file as the data is processed. >> >> For example if I have a 50Gb file, I would like to read it in parallel >> with 10 process/thread, each working on a 10Gb data and perform the >> same data parallel computation on each chunk of fine collating the >> output to a single file. >> >> I will appreciate your feedback. I did find some threads about this on >> stackoverflow but it was not clear to me what would be a good way to >> go about implementing this. >> > > Have you written a single-core solution to your problem? If so, can > you post the code here? > > If CPU isn't your primary bottleneck, then you need to be careful not > to overly complicate your solution by getting multiple cores > involved. All the coordination might make your program slower and > more buggy. > > If CPU is the primary bottleneck, then you might want to consider an > approach where you only have a single thread that's reading records > from the file, 10 at a time, and then dispatching out the calculations > to different threads, then writing results back to disk. > > My approach would be something like this: > > 1) Take a small sample of your dataset so that you can process it > within 10 seconds or so using a simple, single-core program. > 2) Figure out whether you're CPU bound. A simple way to do this is > to comment out the actual computation or replace it with a trivial > stub. If you're CPU bound, the program will run much faster. If > you're IO-bound, the program won't run much faster (since all the work > is actually just reading from disk). > 3) Figure out how to read 10 records at a time and farm out the > records to threads. Hopefully, your program will take significantly > less time. At this point, don't obsess over collating data. It might > not be 10 times as fast, but it should be somewhat faster to be worth > your while. > 4) If the threaded approach shows promise, make sure that you can > still generate correct output with that approach (in other words, > figure out out synchronization and collating). > > At the end of that experiment, you should have a better feel on where > to go next. > > What is the nature of your computation? Maybe it would be easier to > tune the algorithm then figure out the multi-core optimization. > > > > > -- > http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list