On Sun, 11 Nov 2007 08:25:15 -0800, Michael Bacarella wrote: > Firstly, thank you for all of your help so far, I really appreciate it. > >> > So, you think the Python's dict implementation degrades towards > O(N) >> > performance when it's fed millions of 64-bit pseudo-random longs? >> >> No. > > Yes. > > I tried your code (with one change, time on feedback lines) and got the > same terrible > performance against my data set.
Hmmm... you're getting even worse performance than I do. I read the dict in pretty quickly considering the limited memory on my system, and only run into trouble when I try deleting the dict. But still... no, Python dicts shouldn't degrade that badly. Whatever is going on, it almost certainly isn't a problem with the implementation of dicts. Here's a thought... try reading the data into one big list of key-value pairs instead of a dict. If you still get terrible performance, I think that pretty much eliminates the dict as the culprit. > To prove that my machine is sane, I ran the same against your generated > sample file and got _excellent_ performance. Start to finish in under a > minute. Not a fair comparison, because it is doing something completely different. But still valuable to see that your install isn't *completely* broken. >> Notice that the dict is completely read into memory in just two and a >> half minutes. The script then tries to delete the dict, and 32 >> minutes >> later is still struggling. That's the point I got sick of waiting and >> interrupted the script. >> >> Conclusion: it's a memory issue, or maybe a garbage collection issue, >> not a problem with dicts. > > As you can see, not the case at all against my data set. Yes, I see now that your problems are worse than my problems. >> (1) Presumably you don't want to run your app with the garbage >> collector turned off. You might still want to play around with the gc >> module to see what you can learn. > > As you can see, your version did no better. :( Somewhat better, but still struggling. >> (2) More memory will help avoid paging. If you can't get more memory, >> try more virtual memory. It will still be slow, but at least the >> operating >> system doesn't have to try moving blocks around as much. > > The machine has 8GB, and is not doing anything else when I run this. Yes, fair enough. >> (3) Are you sure you need all eight-million-plus items in the cache >> all at once? > > Yes. I remain skeptical, but what do I know, I don't even know what you're doing with the data once you have it :-) >> (4) There are lots of algorithms out there for dealing with data too >> big to fit into main memory. Do some research. > > It DOES fit into main memory and a dictionary is exactly the right way > to do this. I agree now. I think that you and I are experiencing anomalous results. I'm not even certain that it is specifically a Python problem -- I'd be a lot happier if somebody got the same terrible performance on a different OS. What do we have in common? We're using different versions of Python (you 2.3, me 2.5) and different versions of Linux. I wonder if we happen to have the same hardware... what's your CPU? $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 107 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ stepping : 1 cpu MHz : 1000.000 cache size : 512 KB ... processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 107 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ stepping : 1 cpu MHz : 1000.000 cache size : 512 KB ... -- Steven. -- http://mail.python.org/mailman/listinfo/python-list