Hi Rémi,

Thanks for your reply.

> It was faster on what the best approximation of real hardware available at 
> the time, i.e. a Sipeed Lichee Pi4A board. There are no benchmarks in the 
> commit because I don't like to publish benchmarks collected from prototypes.
> Nevertheless I think the commit message hints enough that anybody could 
> easily guess that it was a performance optimisation, if I'm being honest.
> 
> This is not exactly surprising: typical hardware can only access so many 
> memory addresses simultaneously (i.e. one or maybe two), so indexed loads and 
> strided loads are bound to be much slower than unit-strided loads.

I agree that the indexed and strided loads and stores are certainly slower than 
unit-strided loads and stores. However, the vrgather instruction is unlikely to 
be very performant either, unless the vector length is relatively short. 
Particularly, if vector register groups are used via a length multiplier LMUL 
of, e.g., 8, then any element in the destination vector register could be 
sourced from any element in the 8 source vector registers (i.e., 1/4 of the 
vector register file).

AFAIK (but please correct me if I am wrong) the Sipeed Lichee Pi4A uses a 
quad-core XT-910, which depending on the exact variant has a vector register 
length (VLEN) of either 64 or 128 bits, so given the configured element width 
of 32 bits and length multiplier of 2, we are looking at vectors of 4 or 8 
elements.

There is a comment that reads "e16/m2 and e32/m4 are possible but slower due to 
gather", which does not surprise me, since the performance of vrgather most 
likely scales quadratically compared to the vector length. Similarly, vrgather 
is likely less performant on a RISC-V CPU with larger VLEN, since the hardware 
resources for a crossbar required for a permutation over the full vector 
register length become prohibitive for VLEN beyond 128 bits. This requires the 
permutation to be spread over several iterations instead, which need to cover 
every combination of input and output elements (hence the quadratic growth in 
execution time).

By contrast, the performance of strided loads and stores, while certainly 
slower than unit-strided loads and stores, likely scales linearly with the 
vector length, so on CPUs with large VLEN the original code could very well run 
faster than the variant with vrgather, despite the slower strided loads and 
stores.

> Maybe you have access to special hardware that is able to optimise the 
> special case of strides equal to minus one to reduce the number of memory 
> accesses.
> But I didn't back then, and as a matter of fact, I still don't. Hardware 
> donations are welcome.

Hardware availability is indeed still an issue for RISC-V vector processing.

> > The RISC-V vector loads and stores support negative stride values for 
> > use cases such as this one.
> 
> [Citation required]

The purpose of strided loads and stores is to load/store elements that are not 
consecutive in memory, but instead separated by a constant offset. 
Additionally, the authors of the specification decided to allow negative stride 
values, since they apparently deemed it useful to be able to reverse the order 
of those elements.

> > Using vrgather instead replaces the more specific operation with a 
> > more generic one,
> 
> That is a very subjective and unsubstantiated assertion. This feels a bit 
> hypocritical while you are attacking me for not providing justification.

vrgather is more generic because it can be used for any kind of permutation, 
which strided loads and stores cannot. This is not subjective.

> As far as I can tell, neither instruction are specific to reversing vector 
> element order. An actual real-life specific instruction exists on Arm in the 
> form of vector-reverse. I don't know any ISA with load-reverse or store- 
> reverse.

A load-reverse or store-reverse would just be a special case of strided 
load/store.

> > which is likely to be less performant on most HW architectures.
> 
> Would you care to define "most architectures"? I only know one commercially 
> available hardware architecture as of today, Kendryte K230 SoC with T-Head
> C908 CPU, so I can't make much sense of your sentence here.

When writing about the performance of vrgather I primarily had the scalability 
issues explained above in mind. It seems that you have already experienced 
these, since you found that a larger LMUL reduces the performance of vrgather.

> > In addition, it requires to setup an index vector,
> 
> That is irrelevant since in this loop, the vector bank is not a bottleneck.
> The loop can run with maximul LMUL either way. And besides, the loop turned 
> out to be faster with a smaller multiplier.

That is because the performance of vrgather does not scale linearly. I would 
assume that this does not happen with the original code (i.e., the performance 
of strided loads/stores does not decrease for larger LMUL).

> > thus raising dynamic instruction count.
> 
> It adds only one instruction (reverse subtraction) in the main loop,

If I read the diff correctly, the strided load is replaced by a unit-strided 
load and a vrgather (one instruction replaced by two). So, there are two 
additional instructions in the main loop.

> and even that could be optimised away if relevant.

How would the reverse subtraction be optimized away? I assume that it needs to 
be part of the loop since it depends on the VL of the current iteration.

Michael
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to