On Tue, Mar 29, 2016 at 6:02 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > That may be easily due to differences between the CPUs and configuration. > For example the Xeon uses a way older CPU with different amounts of CPU > cache, and it's also a multi-socket system. And so on.
So, having searched past threads I guess this was your Xeon E5450, which has a 12MB cache. I also see that you have an Intel Core i5-2500K Processor, which has 6MB of L2 cache. This hardware is mid-end, and the CPUs were discontinued in 2010 and 2013 respectively. Now, the i5 has a smaller L2 cache, so if anything I'd expect it to do worse than the Xeon, not better. But leaving that aside, I think there is an issue that we don't want to lose sight of. Which is: In most of the regressions we were discussing today, perhaps the entire heap structure can fit in L2 cache. This would be true for stuff like int4 CREATE INDEX builds, where a significant fraction of memory is used for IndexTuples, which most or all comparisons don't have to read in memory. This is the case with a CPU that was discontinued by the manufacturer just over 5 years ago. I think this is why "padding" cases can make the patch look not much better and occasionally worse at the low end: Those keep the number of memtuples as a fraction of work_mem very low, and so mask the problems with the replacement selection heap. When Greg Stark benchmarked the patch at the low end, to identify regressions, he did find some slight regressions at the lowest work_mem settings with many many passes, but they were quite small [1]. Greg also did some good analysis of the performance characteristics of external sorting today [2] that I recommend reading if you missed. It's possible that those regressions have since been fixed, because Greg did not apply/test the memory batching patch that became commit 0011c0091e886b as part of this. It seems likely that it's at least partially fixed, and it might even be better than master overall, now. Anyway, what I liked about Greg's approach to finding regressions at the low end was that when testing, he used the cheapest possible VM available on Google's cloud platform. When testing the low end, he had low end hardware to go with the low end work_mem settings. This gave the patch the benefit of using quicksort to make good use of what I assume is a far smaller L2 cache; certainly nothing like 6MB or 12MB. I think Greg might have used a home server to test my patch in [1], actually, but I understand that it too was suitably low-end. It's perfectly valid to bring economics into this; typically, an external sort occurs only because memory isn't infinitely affordable, or it isn't worth provisioning enough memory to be totally confident that you can do every sort internally. With external sorting, the constant factors are what researchers generally spend most of the time worrying about. Knuth spends a lot of time discussing how the characteristics of actual magnetic tape drives changed throughout the 1970s in TAOCP Volume III. It's quite valid to ask if anyone would actually want to have an 8MB work_mem setting on a machine that has 12MB of L2 cache, cache that an external sort gets all to itself. Is that actually a practical setup that anyone would want to use? [1] http://www.postgresql.org/message-id/CAM-w4HOwt0C7ZndowHUuraw+xi+BhY5a6J008XoSq=r9z7h...@mail.gmail.com [2] http://www.postgresql.org/message-id/CAM-w4HM4XW3u5kVEuUrr+L+KX3WZ=5JKk0A=djjzypkb-hy...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers