On Sun, Jun 15, 2014 at 02:58:03AM -0400, George Spelvin wrote:
> Actually, could you hold off merging that for a bit? Or if you really
> want to, add a comment that the rotate constants are preliminary and
> will be further optimized.
>
> I posted that as WIP, just as evidence that something wit
Actually, could you hold off merging that for a bit? Or if you really
want to, add a comment that the rotate constants are preliminary and
will be further optimized.
I posted that as WIP, just as evidence that something with better
avalanche could be as fast or faster, and I'd like to go back and
On Sat, Jun 14, 2014 at 08:23:33PM -0400, George Spelvin wrote:
> The example I posted:
>
> // (29/66353) score = 49/121/123: 6 27 16 14
>
> a += b; c += d;
> b = rol32(a, 6);d = rol32(c, 27);
> d ^= a; b ^= c;
>
> a += b;
> I'll need to do a bit more looking to convince myself that 2
> iterations is better from a mixing perspective, but this looks like it
> might be a promising replacement for the 32-bit mixer.
This was tested with a hacked-up copy of Bob Jenkins' testing code.
It tests a few random starting value
OK, using your averaging scheme, on a 32-bit KVM kernel, running at
idle, here are my quick results:
Original
48028 51419
46021 50065
44750 49231
Fastmix 2, 3 interations
95956 58313
97295 57599
97242 56942
Fastmix 2, 2 iterations
68998 41496
68940 41471
68619 41576
Fastmix 2, 2 iteratio
Unfortunately, my test suite is not particularly lightweight.
Here are some figures (idle at full speed, and doing a parallel
kernel compile) for a 2.5 GHz Phenom.
Quite different numbers compared to the Ivy Bridge. When the processor is
bust, fast_mix takes longer, instead. And even the rolled
And I have of course embarrassed myself publicly by getting the sign
wrong. That's what I get for posting *before* booting the result.
You may now point and bray like a donkey. :-)
Anyway. the following actually works:
#if ADD_INTERRUPT_BENCH
static unsigned long avg_cycles, avg_deviation;
#d
> Want to give this patch a try on your collection of machines?
Not until I fix it to
+#if ADD_INTERRUPT_BENCH
+static unsigned long avg_cycles;
+
+#define AVG_SHIFT 8/* Exponential average factor k=1/256 */
+#define FIXED_1_2 (1 << (AVG_SHIFT-1))
+
+static void add_interrupt_bench(cycles_t
On Sat, Jun 14, 2014 at 01:25:27AM -0400, George Spelvin wrote:
> > When I did a quick comparison of your 64-bit fast_mix2 variant, it's
> > much slower than either the 32-bit fast_mix2, or the original fast_mix
> > alrogithm.
>
> That is f***ing *bizarre*. For me, it's *significantly* faster.
>
OK, so I normally do my testing using 32-bit kernels under KVM. On a
i386 kernel, running under kvm (again using a Lenovo T540 with a
i7-4900MQ CPU), with the original fast_mix, I get a timestamp count of
166 cycles using a weighted smoothed average. Using your fast_mix2 I
get around 450 cycles.
> At least for Intel, between its branch predictor and speculative
> execution engine, it doesn't make a difference.
*Sigh*. We need live measurement. My testing (in your test
harness!) showed a noticeable (~10%) speedup.
> When I did a quick comparison of your 64-bit fast_mix2 variant, it's
On Fri, Jun 13, 2014 at 10:10:14PM -0400, George Spelvin wrote:
> > Unrolling doesn't make much difference; which isn't surprising given
> > that almost all of the differences go away when I commented out the
> > udelay(). Basically, at this point what we're primarily measuring is
> > how good var
> Unrolling doesn't make much difference; which isn't surprising given
> that almost all of the differences go away when I commented out the
> udelay(). Basically, at this point what we're primarily measuring is
> how good various CPU's caches work, especially across context switches
> where other
On Thu, Jun 12, 2014 at 08:23:04PM -0400, George Spelvin wrote:
> Another cache we might be hitting is the branch predictor. Could you try
> unrolling fast_mix2 and fast_mix4 and see what difference that makes?
> (I'd send you a patch but you could probably do it by hand faster than
> appying one.
> So I just tried your modified 32-bit mixing function where you the
> rotation to the middle step instead of the last step. With the
> usleep(), it doesn't make any difference:
>
> # schedtool -R -p 1 -e /tmp/fast_mix2_48
> fast_mix: 212 fast_mix2: 400 fast_mix3: 400
> fast_mix: 208 fast_mix2:
So I just tried your modified 32-bit mixing function where you the
rotation to the middle step instead of the last step. With the
usleep(), it doesn't make any difference:
# schedtool -R -p 1 -e /tmp/fast_mix2_48
fast_mix: 212 fast_mix2: 400 fast_mix3: 400
fast_mix: 208 fast_mix2: 408 fast_
One of the reasons for the timing instability is because of the
usleep() calls. This allows some other process to schedule, and thus
disrupts the I-cache and uop caches. With the usleep() calls:
i7-4900MQ# schedtool -R -p 1 -e /tmp/fast_mix2_49
fast_mix: 213fast_mix2: 280
fast_mix: 336
Ted, just as an example of a possible tweak, the following change seems
to sightly improve speed without reducing diffusion. (It now looks a
bit more like the Skein mix() function.)
I'll look for more even more efficient kernels. It appears that the
inital XOR is costing a lot; 8 memory fetches
> I redid my numbers, and I can no longer reproduce the 7x slowdown. I
> do see that if you compile w/o -O2, fast_mix2 is twice as slow. But
> it's not 7x slower.
For my single-round, I needed to drop to 2 loops rather than 3 to match
the speed. That's in the source I posted, but I didn't point
19 matches
Mail list logo