On Wed, Sep 16, 2015 at 7:27 PM, Victor Hooi <victorh...@gmail.com> wrote: > Also, I originally used grouper because I thought it better to process lines > in batches, rather than individually. However, is there actually any > throughput advantage from doing it this way in Python? Or is there a better > way of getting better throughput? >
I very much doubt it'll improve throughput; what you're doing there is reading individual lines, then batching them up into blocks of 1000, and then stepping through the batches. In terms of disk read performance, you're already covered, because the file object should be buffered; if you're not doing much actual work in Python, that's probably where your bottleneck is. But keep in mind the basic rules of performance optimization: 1) Don't. 2) For experts only: Don't yet. 3) Measure first. If you remember only the first rule, you're going to be correct most of the time. Write your code to be idiomatic and clean, and *don't worry* about performance. The second rule comes into play once you have a fully working program, and you find that it's running too slowly. (For example, you run "cat filename >/dev/null" and it takes half a second, but you run your program on the same input file and it takes half a day.) Okay, so you know your program needs some work. But which parts of it are actually taking the time? If you just stare at your code and make a guess, *you will be wrong*. So you follow the third rule: Add a boatload of timing marks to the code. They'll slow it down, of course, but you'll usually find that large slabs of the code are so fast you can't even measure the time they're taking, so there's no point optimizing them in any way. Only once you've proven (a) that your program is "too slow" (for some measure of "slow"), and (b) that it's _this part_ that's taking the bulk of the time, *then* you can start improving performance. So get rid of the grouper; it's violating all three rules. Give the program a try without it, and see if you actually have a problem at all. Maybe you don't! ChrisA -- https://mail.python.org/mailman/listinfo/python-list