On Tue, Nov 4, 2025 at 8:57 PM Robin Dapp <[email protected]> wrote:
>
> > Sifive core has that optimization for part of the cores like x280, but not
> > for p470/p670, and seems like Tenstorrent Ascalon also doing that
> > optimization as well? (they set that on both LLVM and GCC).
>
> Does having that optimization imply that it is indeed as fast or faster than a
> scalar load and a broadcast in terms of latency and throughput?
> IMHO we have three hardware "tiers" (slow, fast but worse than scalar, same as
> or better than scalar) but the switch is only binary.  Our design has the
> optimization but using scalar + broadcast is still faster and the same is true
> for the Banana Pi.  So even though it is "fast" we'd still disable it.

Yeah, that's faster than scalar + broadcast in x280 and all other
SiFive Intelligence cores.


> I'm not sure about Ascalon, their public numbers are from before camel-cdr
> updated his benchmark to include zero strides.
>
> --
> Regards
>  Robin
>

Reply via email to