On 30/08/12 23:19, Abhishek Pratap wrote:

I am wondering how can I go about reading data from this at a faster
pace and then farm out the jobs to worker function using
multiprocessing module.

I can think of two ways.

1. split the split and read it in parallel(dint work well for me )
primarily because I dont know how to read a file in parallel
efficiently.

Can you show us what you tried? It's always easier to give an answer to a concrete example than to a hypethetical scenario.

2. keep reading the file sequentially into a buffer of some size and
farm out a chunks of the data through multiprocessing.

This is the model I've used. In pseudo code

for line, data in enumerate(file):
   while line % chunksize:
       chunk.append(data)
   launch_subprocess(chunk)

I'd tend to go for big chunks - if you have a million lines in your file I'd pick a chunksize of around 10,000-100,000 lines. If you go too small the overhead of starting the subprocess will swamp any gains you get. Also remember the constraints of how many actual CPUs/Cores you have. Too many tasks spread over too few CPUs will just cause more swapping. Any less than 4 cores is probably not worth the effort. Just maximise the efficiency of your algorithm - which is probably worth doing first anyway.

HTH,
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to