On Sat, Jul 6, 2024, at 11:34, Dean Rasheed wrote: > On Fri, 5 Jul 2024 at 18:37, Joel Jacobson <j...@compiler.org> wrote: >> >> On Fri, Jul 5, 2024, at 18:42, Joel Jacobson wrote: >> > Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch >> > is now the winner on all my CPUs: >> >> I thought it would be interesting to also measure the isolated effect >> on just numeric_mul() without the query overhead. >> >> Impressive speed-up, between 25% - 81%. >> > > Cool. I think we should go with the mul_var_small() patch then, since > it's more generally applicable.
I agree. > I also did some testing with much larger var2 values, and saw similar > speed-ups. One high-level function that benefits from that is > factorial(), which accepts inputs up to 32177, and so uses both the > 1-digit and 2-digit code with very large var2 values. I doubt anyone > actually uses it with such large inputs, but it's interesting > nonetheless: > > SELECT factorial(32177); > Time: 923.117 ms -- HEAD > Time: 534.375 ms -- mul_var_small() patch Nice! > I did one more round of (mostly cosmetic) copy-editing. Aside from > improving some of the comments, it occurred to me that there's no need > to pass rscale to mul_var_small(), or for it to call round_var(), > since it's always computing the exact result. That shaves off a few > more cycles. Nice, also cleaner. > Additionally, I didn't like how res_weight and res_ndigits were being > set 1 higher than they needed to be. That makes sense in mul_var() > because it may round the result, causing a non-zero carry to propagate > into the next digit up, but it's just confusing in mul_var_small(). So > I've reduced those by 1, which makes the look much more logical. To be > clear, this doesn't change how many digits we're calculating. But now > res_ndigits is actually the number of digits being calculated, whereas > before, res_ndigits was 1 larger and we were calculating res_ndigits - > 1 digits, which was confusing. Nice, much cleaner. > I think this is good to go, so unless there are any further comments, > I plan to commit it soon. LGTM. Benchmark, only on Apple M3 Max: SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD Time: 3042.157 ms (00:03.042) Time: 3027.711 ms (00:03.028) Time: 3078.215 ms (00:03.078) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v8 Time: 2700.676 ms (00:02.701) Time: 2713.594 ms (00:02.714) Time: 2704.139 ms (00:02.704) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD Time: 4506.064 ms (00:04.506) Time: 3316.204 ms (00:03.316) Time: 3321.086 ms (00:03.321) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v8 Time: 2904.786 ms (00:02.905) Time: 2921.996 ms (00:02.922) Time: 2919.269 ms (00:02.919) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD Time: 4636.051 ms (00:04.636) Time: 3439.951 ms (00:03.440) Time: 3471.245 ms (00:03.471) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v8 Time: 3034.364 ms (00:03.034) Time: 3025.351 ms (00:03.025) Time: 3075.024 ms (00:03.075) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD Time: 4978.086 ms (00:04.978) Time: 3580.283 ms (00:03.580) Time: 3582.719 ms (00:03.583) SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v8 Time: 3147.352 ms (00:03.147) Time: 3135.903 ms (00:03.136) Time: 3172.491 ms (00:03.172) Regards, Joel