Andreas, On Sun, 09. Jun 13:00, Andreas Rheinhardt wrote: > Up until now, the bitmasks used to initially find out when one needs > to take a closer look and search for startcodes were rather primitive: > If a block (of four or eight bytes, depending on the system) contained a > zero, it was treated as a target for closer inspection. > This can be improved: Using the same technique of bitmasking as before, > one can test whether an arbitrary continuous range of bits consists of > zeros alone (and one can or the results of tests for non-overlapping > ranges of bits). A simple improvement would consist in simply testing > for whether every even byte does not vanish. But even better masks are > possible, in particular for big endian systems: > If all of the bits called 'a' or all of the bits called 'b' in the > uint32_t aaaa aaaa aaaa aaax bbbb bbbb bbbb bbby vanish, then this is > a candidate for closer inspection. If not, then it follows that > the three least serious bytes can't be the position of the 0x01 byte of > a 0x00 0x00 0x01 startcode. And if the most significant byte is the > position of the 0x01 of a startcode and if the same test is performed on > each block of four bytes in sequence, then all the b bits (as well as > the y bit) of the previous uint32_t would vanish and it would have been > considered a candidate for closer inspection. In other words: All > startcodes can be found with this test. The amount of candidates is > thereby reduced by a factor of about 256 (from about 4/2^8 to about > 2/2^15). For eight bytes at a time, the patterns are simply applied to > the high and low 32 bits at the same time. > Unfortunately, not so much is possible for little-endian systems because > continuous blocks of bits in the input byte array won't be continuous > blocks in the integer read. But one can nevertheless even improve upon > the "check every even/odd byte" test using the following mask: > aaaa aaaa aaaa bbbb bbbb bbbb cccc cccc > If any of these three blocks of bits vanishes, the block is a candidate > for further inspection. In the byte array, this mask corresponds to the > following situation: > cccc cccc bbbb bbbb aaaa bbbb aaaa aaaa > If a startcode's 0x01 is at the second or third position, the c block > vanishes; if it is at the fourth position, the b block vanishes and if > it is at the first position, the a block of the previous four-byte block > vanishes. (This derivation just made use of the fact that there must be > two vanishing bytes in sequence; I couldn't find a way to utilize that > the third byte also has nearly no bits set.) So all startcodes can be > found with this mask. Doubling this mask yields a mask for eight bytes > at a time. I haven't found a better mask for eight bytes than this one. > Using this, one can detect reduce the amount of false positives by about > 72% compared to the earlier check whether one of the bytes vanishes. > > Being more effective at initial filtering makes the whole process more > efficient, of course. Here are some benchmarks for a 30.2 Mb/s transport > stream A (10 iterations of 8192 runs each) and another 7.4 Mb/s transport > stream B (10 iterations of 131072 runs each) on an x64; except for > "vanilla" all versions tested didn't return false positives. > > | vanilla | aligned | aligned | > | (unaligned) | bad masks | good masks | > --|-------------|-----------|------------| > A | 411578 | 323355 | 233494 | > B | 55476 | 44147 | 31793 | > > "vanilla" is the version before this patchset, i.e. before the switch to > aligned reads. "aligned bad masks" is the version before this commit. > "aligned good masks" is the current version. > > Signed-off-by: Andreas Rheinhardt <andreas.rheinha...@gmail.com> > --- > libavcodec/startcode.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/libavcodec/startcode.c b/libavcodec/startcode.c > index b29c1490c0..fa506a37d1 100644 > --- a/libavcodec/startcode.c > +++ b/libavcodec/startcode.c > @@ -95,10 +95,18 @@ int ff_startcode_find_candidate_c(const uint8_t *buf, int > size) > > #if HAVE_FAST_64BIT > INITIALIZATION(8); > - MAIN_LOOP(64, 0x0101010101010101ULL, 0x8080808080808080ULL); > +#if HAVE_BIGENDIAN > + MAIN_LOOP(64, 0x0002000200020002ULL, 0x8000800080008000ULL); > +#else > + MAIN_LOOP(64, 0x0010010100100101ULL, 0x8008008080080080ULL); > +#endif > #else > INITIALIZATION(4); > - MAIN_LOOP(32, 0x01010101U, 0x80808080U); > +#if HAVE_BIGENDIAN > + MAIN_LOOP(32, 0x00020002U, 0x80008000U); > +#else > + MAIN_LOOP(32, 0x00100101U, 0x80080080U); > +#endif > #endif > > /* For the end, buf is the earliest possible position > -- > 2.21.0
Below are the results of this patchset on Armv7 using OdroidXU4. OdroidXU4 has 8 cores - 4xARM Cortex A15 and 4xARM Cortex A7. Only the A15 cores were used for this test. Test sequences: Sequence C - 2560x1440, 20.4Mb/s, 8105 frames Source: https://www.youtube.com/watch?v=1La4QzGeaaQ Sequence D - 1920x1080, 4.8Mb/s, 18808 frames Source: https://www.youtube.com/watch\?v\=LXb3EKWsInQ Averaged results: Sequence C - 10 iterations with 16384 runs Sequence D - 10 iterations with ~32768 runs | before | first patch | full patchset | | patchset | only - 1/5 | 5/5 | | (decicyles) | (decicycles)| (decicyles) | --|-------------|-------------|---------------| C | 910181 | 864827 | 861376 | D | 74798 | 70529 | 69574 | I also have a Jetson Nano that uses Armv8. Let me know if you need these simulations on 64bit Arm. Andriy _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".