In article <roy-103d43.15470305012...@news.panix.com>, Roy Smith <r...@panix.com> wrote:
> It's rare to find applications these days that are truly CPU bound. > Once you've used some reasonable algorithm, i.e. not done anything in > O(n^2) that could have been done in O(n) or O(n log n), you will more > often run up against I/O speed, database speed, network latency, memory > exhaustion, or some such as the reason your code is too slow. Well, I just found a counter-example :-) I've been doing some log analysis. It's been taking a grovelingly long time, so I decided to fire up the profiler and see what's taking so long. I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots might be (looking up IP addresses in the geolocation database, or producing some pretty pictures using matplotlib). It was just a matter of figuring out which it was. As with most attempts to out-guess the profiler, I was totally, absolutely, and embarrassingly wrong. It turns out we were spending most of the time parsing timestamps! Since there's no convenient way (I don't consider strptime() to be convenient) to parse isoformat strings in the standard library, our habit has been to use the oh-so-simple parser from the third-party dateutil package. Well, it turns out that's slow as all get-out (probably because it's trying to be smart about auto-recognizing formats). For the test I ran (on a few percent of the real data), we spent 90 seconds in parse(). OK, so I dragged out the strptime() docs and built the stupid format string (%Y-%m-%dT%H:%M:%S+00:00). That got us down to 25 seconds in strptime(). But, I could also see it was spending a significant amount in routines that looked like they were computing things like day of the week that we didn't need. For what I was doing, we only really needed the hour and minute. So I tried: t_hour = int(date[11:13]) t_minute = int(date[14:16]) that got us down to 12 seconds overall (including the geolocation and pretty pictures). I think it turns out we never do anything with the hour and minute other than print them back out, so just t_hour_minute = date[11:16] would probably be good enough, but I think I'm going to stop where I am and declare victory :-) -- http://mail.python.org/mailman/listinfo/python-list