mips: MSA (MIPS-SIMD-Arch) optimizations for VP9 lpf functions

Shivraj Patil Mon, 20 Jul 2015 07:32:25 -0700

Hi,

On Thu, Jul 9, 2015 at 9:15 AM, 
<shivraj.pa...@imgtec.com<mailto:shivraj.pa...@imgtec.com>> wrote:
+    if (__msa_test_bz_v(flat)) {
+        p1_d = __msa_copy_u_d((v2i64) p1_out, 0);
+        p0_d = __msa_copy_u_d((v2i64) p0_out, 0);
+        q0_d = __msa_copy_u_d((v2i64) q0_out, 0);
+        q1_d = __msa_copy_u_d((v2i64) q1_out, 0);
+        SD4(p1_d, p0_d, q0_d, q1_d, (src - 2 * pitch), pitch);
+    } else {


Can you elaborate on what this does? Does it check that none of the pixels in 
the vector of cols/rows has flat=1, and takes a shortcut if that's true? Of 
something else? (If I'm right in my assumption, can you please add a comment to 
that effect?)

Your assumption is correct. I will add appropriate comments.

+static void vp9_lpf_vertical_16_dual_msa(uint8_t *src, int32_t pitch,
+                                         uint8_t *b_limit_ptr,
+                                         uint8_t *limit_ptr,
+                                         uint8_t *thresh_ptr)
+{
+    uint8_t early_exit = 0;
+    uint8_t transposed_input[16 * 24] ALLOC_ALIGNED(ALIGNMENT);
+    uint8_t *filter48 = &transposed_input[16 * 16];
+
+    vp9_transpose_16x16((src - 8), pitch, &transposed_input[0], 16);
+
+    early_exit = vp9_vt_lpf_t4_and_t8_16w((transposed_input + 16 * 8),
+                                          &filter48[0], src, pitch,
+                                          b_limit_ptr, limit_ptr, thresh_ptr);
+
+    if (0 == early_exit) {
+        early_exit = vp9_vt_lpf_t16_16w((transposed_input + 16 * 8), src, 
pitch,
+                                        &filter48[0]);
+
+        if (0 == early_exit) {
+            vp9_transpose_16x16(transposed_input, 16, (src - 8), pitch);
+        }
+    }
+}

Since no state is shared between t16 and t4/t8, it suggests you're calculating 
some of the filters twice (since part of the condition of whether to apply the 
t16 filter is whether to apply the t8 filter), is that true? If so, do you 
think it's worth modifying this so the check on whether to run t4 or t8 is not 
re-evaluated in t16?

t4 and t8 are not re-evaluated in t16 filter function “vp9_vt_lpf_t16_16w”.
t4 and t8 filters are applied in function “vp9_vt_lpf_t4_and_t8_16w” and output 
is stored at location pointed by “filter48”, which is later used in t16 
function based on early exit flag.

+void ff_loop_filter_v_84_16_msa(uint8_t *src, ptrdiff_t stride,
+                                int32_t e, int32_t i, int32_t h)
+{
+    uint8_t e1, i1, h1;
+    uint8_t e2, i2, h2;
+
+    e1 = e & 0xff;
+    i1 = i & 0xff;
+    h1 = h & 0xff;
+
+    e2 = e >> 8;
+    i2 = i >> 8;
+    h2 = h >> 8;
+
+    vp9_lpf_horizontal_8_msa(src, stride, &e1, &i1, &h1, 1);
+    vp9_lpf_horizontal_4_msa(src + 8, stride, &e2, &i2, &h2, 1);
+}

So I think you're missing the point of why this exists. The simd code for e.g. 
88_16 suggests you're capable of doing 16 pixels at once in a single iteration, 
right? The idea here is that you can use the fact that t4 is a strict subset of 
t8 to run them both in the same iteration, with simply a mask at the end to 
assure that "whether to run t8 or t4" for the t4 half of the pixels is always 
0. Look at the x86 simd code for details on how that would work exactly.

Agreed, will incorporate the same.

Shivraj
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH 2/4] avcodec/mips: MSA (MIPS-SIMD-Arch) optimizations for VP9 lpf functions

Reply via email to