Re: [FFmpeg-devel] [PATCH] swresample/arm: add ff_resample_common_apply_filter_{x4, x8}_{float, s16}_neon

Benoit Fouet Thu, 12 May 2016 01:02:26 -0700

Hi,

I mostly have nits remarks.


On 11/05/2016 18:39, Matthieu Bouron wrote:

From: Matthieu Bouron <matthieu.bou...@stupeflix.com>


[...]

diff --git a/libswresample/arm/resample.S b/libswresample/arm/resample.S
new file mode 100644
index 0000000..13462e3
--- /dev/null
+++ b/libswresample/arm/resample.S
@@ -0,0 +1,77 @@

[...]

+function ff_resample_common_apply_filter_x4_float_neon, export=1
+    vmov.f32            q0, #0.0                                       @ 
accumulator
+1:  vld1.32             {q1}, [r1]!                                    @ src
+    vld1.32             {q2}, [r2]!                                    @ filter
+    vmla.f32            q0, q1, q2                                     @ src + 
{0..3} * filter + {0..3}


nit: the comment could be "accu += src[0..3] . filter[0..3]"
same for the other ones below

[...]

+    subs                r3, #4                                         @ 
filter_length -= 4
+    bgt                 1b                                             @ loop 
until filter_length
+    vpadd.f32           d0, d0, d1                                     @ pair 
adding of the 4x32-bit accumulated values
+    vpadd.f32           d0, d0, d0                                     @ pair 
adding of the 4x32-bit accumulator values
+    vst1.32             {d0[0]}, [r0]                                  @ write 
accumulator
+    mov pc, lr
+endfunc
+
+function ff_resample_common_apply_filter_x8_float_neon, export=1
+    vmov.f32            q0, #0.0                                       @ 
accumulator
+1:  vld1.32             {q1}, [r1]!                                    @ src1
+    vld1.32             {q2}, [r2]!                                    @ 
filter1
+    vld1.32             {q8}, [r1]!                                    @ src2
+    vld1.32             {q9}, [r2]!                                    @ 
filter2
+    vmla.f32            q0, q1, q2                                     @ src1 
+ {0..3} * filter1 + {0..3}
+    vmla.f32            q0, q8, q9                                     @ src2 
+ {0..3} * filter2 + {0..3}


instead of using src1 and src2, you may want to use src[0..3] and src[4..7]
so, if I reuse the formulation I proposed above:
accu += src[0..3] . filter[0..3]
accu += src[4..7] . filter[4..7]

+    subs                r3, #8                                         @ 
filter_length -= 4


-= 8

[...]

diff --git a/libswresample/arm/resample_init.c 
b/libswresample/arm/resample_init.c
new file mode 100644
index 0000000..c817d03
--- /dev/null
+++ b/libswresample/arm/resample_init.c

[...]

+static int ff_resample_common_##TYPE##_neon(ResampleContext *c, void *dest, 
const void *source,   \
+                                            int n, int update_ctx)             
                   \
+{                                                                              
                   \
+    DELEM *dst = dest;                                                         
                   \
+    const DELEM *src = source;                                                 
                   \
+    int dst_index;                                                             
                   \
+    int index= c->index;                                                       
                   \
+    int frac= c->frac;                                                         
                   \
+    int sample_index = index >> c->phase_shift;                                
                   \
+    int x4_aligned_filter_length = c->filter_length & ~3;                      
                   \
+    int x8_aligned_filter_length = c->filter_length & ~7;                      
                   \
+                                                                               
                   \
+    index &= c->phase_mask;                                                    
                   \
+    for (dst_index = 0; dst_index < n; dst_index++) {                          
                   \
+        FELEM *filter = ((FELEM *) c->filter_bank) + c->filter_alloc * index;  
                   \
+                                                                               
                   \
+        FELEM2 val=0;                                                          
                   \
+        int i = 0;                                                             
                   \
+        if (x8_aligned_filter_length >= 8) {                                   
                   \
+            ff_resample_common_apply_filter_x8_##TYPE##_neon(&val, 
&src[sample_index],            \
+                                                             filter, 
x8_aligned_filter_length);   \
+            i += x8_aligned_filter_length;                                     
                   \
+                                                                               
                   \
+        } else if (x4_aligned_filter_length >= 4) {                            
                   \

do you think there could be a gain processing the remainder of the8-aligned part through the 4-aligned part of the code? e.g. for a filterlength of 15, that would make:

 - one run of the 8-aligned
 - one run of the 4-aligned
 - 3 C loops

As you stated filter length seems to commonly be 32, I guess thatwouldn't be easy to benchmark, though.


[...]

--
Ben

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] swresample/arm: add ff_resample_common_apply_filter_{x4, x8}_{float, s16}_neon

Reply via email to