On May 11, 8:04 am, Jaroslav Dobrek <jaroslav.dob...@gmail.com> wrote: > Hello, > > I wrote the following code for using egrep on many large files: > > MY_DIR = '/my/path/to/dir' > FILES = os.listdir(MY_DIR) > > def grep(regex): > i = 0 > l = len(FILES) > output = [] > while i < l: > command = "egrep " + '"' + regex + '" ' + MY_DIR + '/' + > FILES[i] > result = subprocess.getoutput(command) > if result: > output.append(result) > i += 1 > return output > > Yet, I don't think that the files are searched in parallel. Am I > right? How can I search them in parallel?
subprocess.getoutput() blocks until the command writes out all of its output, so no, they're not going to be run in parallel. You really shouldn't use it anyway, as it's very difficult to use it securely. Your code, as it stands, could be exploited if the user can supply the regex or the directory. There are plenty of tools to do parallel execution in a shell, such as: http://code.google.com/p/ppss/. I would use one of those tools first. Nevertheless, if you must do it in Python, then the most portable way to accomplish what you want is to: 0) Create a thread-safe queue object to hold the output. 1) Create each process using a subprocess.Popen object. Do this safely and securely, which means NOT passing shell=True in the constructor, passing stdin=False, and passing stderr=False unless you intend to capture error output. 2) Spawn a new thread for each process. That thread should block reading the Popen.stdout file object. Each time it reads some output, it should then write it to the queue. If you monitor stderr as well, you'll need to spawn two threads per subprocess. When EOF is reached, close the descriptor and call Popen.wait() to terminate the process (this is trickier with two threads and requires additional synchronization). 3) After spawning each process, monitor the queue in the first thread and capture all of the output. 4) Call the join() method on all of the threads to terminate them. The easiest way to do this is to have each thread write a special object (a sentinel) to the queue to indicate that it is done. If you don't mind platform specific code (and it doesn't look like you do), then you can use fcntl.fcntl to make each file-object non- blocking, and then use any of the various asynchronous I/O APIs to avoid the use of threads. You still need to clean up all of the file objects and processes when you are done, though. -- http://mail.python.org/mailman/listinfo/python-list