> Sifive core has that optimization for part of the cores like x280, but not > for p470/p670, and seems like Tenstorrent Ascalon also doing that > optimization as well? (they set that on both LLVM and GCC).
Does having that optimization imply that it is indeed as fast or faster than a scalar load and a broadcast in terms of latency and throughput? IMHO we have three hardware "tiers" (slow, fast but worse than scalar, same as or better than scalar) but the switch is only binary. Our design has the optimization but using scalar + broadcast is still faster and the same is true for the Banana Pi. So even though it is "fast" we'd still disable it. I'm not sure about Ascalon, their public numbers are from before camel-cdr updated his benchmark to include zero strides. -- Regards Robin
