On Thu, 24 Oct 2013 18:38:21 -0700, Victor Hooi wrote: > Hi, > > We have a directory of large CSV files that we'd like to process in > Python. > > We process each input CSV, then generate a corresponding output CSV > file. > > input CSV -> munging text, lookups etc. -> output CSV > > My question is, what's the most Pythonic way of handling this? (Which > I'm assuming
Start with the simplest thing that could work: for infile, outfile in zip(input_list, output_list): for line in infile: munge line write to outfile sort of thing. If and only if it isn't fast enough, then try to speed it up. > For the reading, I'd > > with open('input.csv', 'r') as input, open('output.csv', 'w') as > output: > csv_writer = DictWriter(output) > for line in DictReader(input): > # Do some processing for that line... output = > process_line(line) > # Write output to file > csv_writer.writerow(output) Looks good to me! > So for the reading, it'll iterates over the lines one by one, and won't > read it into memory which is good. > > For the writing - my understanding is that it writes a line to the file > object each loop iteration, however, this will only get flushed to disk > every now and then, based on my system default buffer size, right? Have you read the csv_writer documentation? http://docs.python.org/2/library/csv.html#writer-objects Unfortunately it is pretty light documentation, but it seems that writer.writerow *probably* just calls write on the underlying file object immediately. But there's really no way to tell when the data hits the disk platter: the Python file object could be doing caching, the OS could be doing caching, the file system could be doing caching, and even the disk itself could be doing caching. Really, the only way to be sure that the data has hit the disk platter is to call os.sync(), and even then some hard drives lie and report that they're synced when in fact the data is still in volatile cache. Bad hard drive, no biscuit. But, really, do you need to care? Better to buy better hard drives (e.g. server grade), or use software RAID with two different brands (so their failure characteristics will be different), or just be prepared to re- process a batch of files if the power goes out mid-run. (You do have a UPS, don't you?) Anyway, I wouldn't bother about calling os.sync directly, the OS will sync when it needs to. But if you need it, it's there. Or you can call flush() on the output file, which is a bit less invasive than calling os.sync. But really I wouldn't bother. Let the OS handle it. > So if the output file is going to get large, there isn't anything I need > to take into account for conserving memory? I shouldn't think so. > Also, if I'm trying to maximise throughput of the above, is there > anything I could try? The processing in process_line is quite line - > just a bunch of string splits and regexes. > > If I have multiple large CSV files to deal with, and I'm on a multi-core > machine, is there anything else I can do to boost throughput? Since this is likely to be I/O-bound, you could use threads. Each thread is responsible for reading the file, processing it, then writing it back again. Have you used threads before? If not, start here: http://www.ibm.com/developerworks/aix/library/au-threadingpython/ http://pymotw.com/2/threading/ If the amount of processing required becomes heavier, and the task becomes CPU-bound, you can either: - move to an implementation of Python without the GIL, like IronPython or Jython; - or use multiprocessing. -- Steven -- https://mail.python.org/mailman/listinfo/python-list