Hi,
I mostly have nits remarks.
On 11/05/2016 18:39, Matthieu Bouron wrote:
From: Matthieu Bouron <matthieu.bou...@stupeflix.com>
[...]
diff --git a/libswresample/arm/resample.S b/libswresample/arm/resample.S
new file mode 100644
index 0000000..13462e3
--- /dev/null
+++ b/libswresample/arm/resample.S
@@ -0,0 +1,77 @@
[...]
+function ff_resample_common_apply_filter_x4_float_neon, export=1
+ vmov.f32 q0, #0.0 @
accumulator
+1: vld1.32 {q1}, [r1]! @ src
+ vld1.32 {q2}, [r2]! @ filter
+ vmla.f32 q0, q1, q2 @ src +
{0..3} * filter + {0..3}
nit: the comment could be "accu += src[0..3] . filter[0..3]"
same for the other ones below
[...]
+ subs r3, #4 @
filter_length -= 4
+ bgt 1b @ loop
until filter_length
+ vpadd.f32 d0, d0, d1 @ pair
adding of the 4x32-bit accumulated values
+ vpadd.f32 d0, d0, d0 @ pair
adding of the 4x32-bit accumulator values
+ vst1.32 {d0[0]}, [r0] @ write
accumulator
+ mov pc, lr
+endfunc
+
+function ff_resample_common_apply_filter_x8_float_neon, export=1
+ vmov.f32 q0, #0.0 @
accumulator
+1: vld1.32 {q1}, [r1]! @ src1
+ vld1.32 {q2}, [r2]! @
filter1
+ vld1.32 {q8}, [r1]! @ src2
+ vld1.32 {q9}, [r2]! @
filter2
+ vmla.f32 q0, q1, q2 @ src1
+ {0..3} * filter1 + {0..3}
+ vmla.f32 q0, q8, q9 @ src2
+ {0..3} * filter2 + {0..3}
instead of using src1 and src2, you may want to use src[0..3] and src[4..7]
so, if I reuse the formulation I proposed above:
accu += src[0..3] . filter[0..3]
accu += src[4..7] . filter[4..7]
+ subs r3, #8 @
filter_length -= 4
-= 8
[...]
diff --git a/libswresample/arm/resample_init.c
b/libswresample/arm/resample_init.c
new file mode 100644
index 0000000..c817d03
--- /dev/null
+++ b/libswresample/arm/resample_init.c
[...]
+static int ff_resample_common_##TYPE##_neon(ResampleContext *c, void *dest,
const void *source, \
+ int n, int update_ctx)
\
+{
\
+ DELEM *dst = dest;
\
+ const DELEM *src = source;
\
+ int dst_index;
\
+ int index= c->index;
\
+ int frac= c->frac;
\
+ int sample_index = index >> c->phase_shift;
\
+ int x4_aligned_filter_length = c->filter_length & ~3;
\
+ int x8_aligned_filter_length = c->filter_length & ~7;
\
+
\
+ index &= c->phase_mask;
\
+ for (dst_index = 0; dst_index < n; dst_index++) {
\
+ FELEM *filter = ((FELEM *) c->filter_bank) + c->filter_alloc * index;
\
+
\
+ FELEM2 val=0;
\
+ int i = 0;
\
+ if (x8_aligned_filter_length >= 8) {
\
+ ff_resample_common_apply_filter_x8_##TYPE##_neon(&val,
&src[sample_index], \
+ filter,
x8_aligned_filter_length); \
+ i += x8_aligned_filter_length;
\
+
\
+ } else if (x4_aligned_filter_length >= 4) {
\
do you think there could be a gain processing the remainder of the
8-aligned part through the 4-aligned part of the code? e.g. for a filter
length of 15, that would make:
- one run of the 8-aligned
- one run of the 4-aligned
- 3 C loops
As you stated filter length seems to commonly be 32, I guess that
wouldn't be easy to benchmark, though.
[...]
--
Ben
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel