On 30/08/12 23:19, Abhishek Pratap wrote:
I am wondering how can I go about reading data from this at a faster
pace and then farm out the jobs to worker function using
multiprocessing module.
I can think of two ways.
1. split the split and read it in parallel(dint work well for me )
primarily because I dont know how to read a file in parallel
efficiently.
Can you show us what you tried? It's always easier to give an answer to
a concrete example than to a hypethetical scenario.
2. keep reading the file sequentially into a buffer of some size and
farm out a chunks of the data through multiprocessing.
This is the model I've used. In pseudo code
for line, data in enumerate(file):
while line % chunksize:
chunk.append(data)
launch_subprocess(chunk)
I'd tend to go for big chunks - if you have a million lines in your file
I'd pick a chunksize of around 10,000-100,000 lines. If you go too small
the overhead of starting the subprocess will swamp any gains
you get. Also remember the constraints of how many actual CPUs/Cores you
have. Too many tasks spread over too few CPUs will just cause more
swapping. Any less than 4 cores is probably not worth the effort. Just
maximise the efficiency of your algorithm - which is probably worth
doing first anyway.
HTH,
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor