Re: [Tutor] using multiprocessing efficiently to process large data file

Alan Gauld Thu, 30 Aug 2012 16:51:37 -0700

On 30/08/12 23:19, Abhishek Pratap wrote:

I am wondering how can I go about reading data from this at a faster
pace and then farm out the jobs to worker function using
multiprocessing module.


I can think of two ways.

1. split the split and read it in parallel(dint work well for me )
primarily because I dont know how to read a file in parallel
efficiently.

Can you show us what you tried? It's always easier to give an answer toa concrete example than to a hypethetical scenario.

2. keep reading the file sequentially into a buffer of some size and
farm out a chunks of the data through multiprocessing.


This is the model I've used. In pseudo code

for line, data in enumerate(file):
   while line % chunksize:
       chunk.append(data)
   launch_subprocess(chunk)

I'd tend to go for big chunks - if you have a million lines in your fileI'd pick a chunksize of around 10,000-100,000 lines. If you go too smallthe overhead of starting the subprocess will swamp any gainsyou get. Also remember the constraints of how many actual CPUs/Cores youhave. Too many tasks spread over too few CPUs will just cause moreswapping. Any less than 4 cores is probably not worth the effort. Justmaximise the efficiency of your algorithm - which is probably worthdoing first anyway.


HTH,
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] using multiprocessing efficiently to process large data file

Reply via email to