On Mon, Apr 22, 2013 at 9:04 PM, Florian Pflug <f...@phlo.org> wrote: > The one downside of the fnv1+shift approach is that it's built around > the assumption that processing 64-bytes at once is the sweet spot. That > might be true for x86 and x86_64 today, but it won't stay that way for > long, and quite surely isn't true for other architectures. That doesn't > necessarily rule it out, but it certainly weakens the argument that > slipping it into 9.3 avoids having the change the algorithm later...
It's actually 128 bytes as it was tested. The ideal shape depends on multiplication latency, multiplication throughput and amount of registers available. Specifically BLCKSZ/mul_throughput_in_bytes needs to be larger than BLCKSZ/(N_SUMS*sizeof(uint32))*(mul latency + 2*xor latency). For latest Intel the values are 8192/16 = 512 and 8192/(32*4)*(5 + 2*1) = 448. 128 bytes is also 8 registers which is the highest power of two fitting into architectural registers (16). This means that the value chosen is indeed the sweet spot for x86 today. For future processors we can expect the multiplication width to increase and possibly the latency too shifting the sweet spot into higher widths. In fact, Haswell coming out later this year should have AVX2 instructions that introduce integer ops on 256bit registers, making the current choice already suboptimal. All that said, having a lower width won't make the algorithm slower on future processors, it will just leave some parallelism on the table that could be used to make it even faster. The line in the sand needed to be drawn somewhere, I chose the maximum comfortable width today fearing that even that would be shot down based on code size. Coincidentally 32 elements is also the internal parallelism that GPUs have settled on. We could bump the width up by one notch to buy some future safety, but after that I'm skeptical we will see any conventional processors that would benefit from a higher width. I just tested that the auto-vectorized version runs at basically identical speed as GCC's inability to do good register allocation means that it juggles values between registers and the stack one way or the other. So to recap, I don't know of any CPUs where a lower value would be better. Raising the width by one notch would mean better performance on future processors, but raising it further would just bloat the size of the inner loop without much benefit in sight. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers