On Thu, 24 Oct 2013 18:38:21 -0700, Victor Hooi wrote: > Hi, > > We have a directory of large CSV files that we'd like to process in > Python. > > We process each input CSV, then generate a corresponding output CSV > file. > > input CSV -> munging text, lookups etc. -> output CSV > > My question is, what's the most Pythonic way of handling this? (Which > I'm assuming > > For the reading, I'd > > with open('input.csv', 'r') as input, open('output.csv', 'w') as > output: > csv_writer = DictWriter(output) > for line in DictReader(input): > # Do some processing for that line... > output = process_line(line) > # Write output to file csv_writer.writerow(output) > > So for the reading, it'll iterates over the lines one by one, and won't > read it into memory which is good. > > For the writing - my understanding is that it writes a line to the file > object each loop iteration, however, this will only get flushed to disk > every now and then, based on my system default buffer size, right? > > So if the output file is going to get large, there isn't anything I need > to take into account for conserving memory? > > Also, if I'm trying to maximise throughput of the above, is there > anything I could try? The processing in process_line is quite line - > just a bunch of string splits and regexes. > > If I have multiple large CSV files to deal with, and I'm on a multi-core > machine, is there anything else I can do to boost throughput? > I'm guessing that the idea is to load the output CSV into a database.
If that's the case, why not load the input CSV into some kind of staging table in the database first, and do the processing there? -- https://mail.python.org/mailman/listinfo/python-list