Le 20 juin 2024 18:02:31 GMT+02:00, Zhao Zhili <quinkbl...@foxmail.com> a écrit 
:
>
>
>> On Jun 20, 2024, at 20:49, Martin Storsjö <mar...@martin.st> wrote:
>> 
>> On Thu, 20 Jun 2024, Zhao Zhili wrote:
>> 
>>>> On Jun 19, 2024, at 20:05, Rémi Denis-Courmont <r...@remlab.net> wrote:
>>>> Le 19 juin 2024 11:24:28 GMT+02:00, Zhao Zhili <quinkbl...@foxmail.com 
>>>> <mailto:quinkbl...@foxmail.com>> a écrit :
>>>>>> On Jun 19, 2024, at 15:07, Rémi Denis-Courmont <r...@remlab.net> wrote:
>>>>>> Le 15 juin 2024 11:57:18 GMT+02:00, Zhao Zhili <quinkbl...@foxmail.com> 
>>>>>> a écrit :
>>>>>>> diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S
>>>>>>> index 2b956fe5c2..37f1158504 100644
>>>>>>> --- a/libswscale/aarch64/input.S
>>>>>>> +++ b/libswscale/aarch64/input.S
>>>>>>> @@ -20,8 +20,12 @@
>>>>>>> #include "libavutil/aarch64/asm.S"
>>>>>>> -.macro rgb_to_yuv_load_rgb src
>>>>>>> +.macro rgb_to_yuv_load_rgb src, element=3
>>>>>>> +    .if \element == 3
>>>>>>>      ld3             { v16.16b, v17.16b, v18.16b }, [\src]
>>>>>>> +    .else
>>>>>>> +        ld4             { v16.16b, v17.16b, v18.16b, v19.16b }, [\src]
>>>>>>> +    .endif
>>>>>>>      uxtl            v19.8h, v16.8b             // v19: r
>>>>>>>      uxtl            v20.8h, v17.8b             // v20: g
>>>>>>>      uxtl            v21.8h, v18.8b             // v21: b
>>>>>>> @@ -43,7 +47,7 @@
>>>>>>>      sqshrn2         \dst\().8h, \dst2\().4s, \right_shift   // 
>>>>>>> dst_higher_half = dst2 >> right_shift
>>>>>>> .endm
>>>>>>> -.macro rgbToY bgr
>>>>>>> +.macro rgbToY bgr, element=3
>>>>>> AFAICT, you don't need to a macro parameter for component order. Just 
>>>>>> swap red and blue coefficients in the prologue and then run the 
>>>>>> bit-exact same loops for bgr/rgb, rgba/bgra and argb/abgr. This adds one 
>>>>>> branch in the prologue but that's mostly negligible compared to the loop.
>>>>> I’m not sure where to add the branch. Could you elaborate? Do you mean 
>>>>> load coefficients first like the following:
>>>>> function ff_bgr24ToUV_half_neon, export=1
>>>>>      ldr             w12, [x6, #12]
>>>>>      ldr             w11, [x6, #16]
>>>>>      ldr             w10, [x6, #20]
>>>>>      ldr             w15, [x6, #24]
>>>>>      ldr             w14, [x6, #28]
>>>>>      ldr             w13, [x6, #32]
>>>>>      rgbToUV_half
>>>>> endfunc
>>>> Hmm, no. You need to jump past the loading of red and blue coefficients. 
>>>> It might help to load green coefficients last.
>>>> By the way, I think you can use LDP instead of LDR.
>>> 
>>> Patch v2 replace LDR by LDP, then the "jump past the loading of red and 
>>> blue coefficients” doesn’t apply now.
>> 
>> Rémi's point is that you don't need to duplicate the whole function, when 
>> the only thing you're changing is a couple of instructions in the prologue 
>> of the function. By reusing the actual bulk of the function, you save on 
>> binary size.
>
>Thank you for the detailed examples. I missed the key point here is to save 
>binary size.
>
>I have seen similar example of fall through in risk/input_rvv.s. Is it well 
>defined to jump to a local label in another function?

Falling through is well defined so long as we don't use function-sections. 
Jumping to a label inside another function is well defined, as the assembler 
has no notion of what a function is.

`func` and `endfunc` are just FFmpeg macros for defining symbols.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to