On Sat, Aug 15, 2015 at 12:17:27AM -0300, Pedro Arthur wrote: > Hi, > Since the last patch I was trying to improve the performance regression. > First I tried to process horizontal lines in batches, processing > (horizontal_filter_size + n) > lines at a time. I also tried to remove branch code from the processing > function, for example: > int process(...) { > if (c->hcscale_fast) { > do_x() > } else { > do_y() > } > } > changed to: > int process_fast(...) {do_x()} > int process_(...) {do_y()} > > But these changes more or less didn't improve the performance at all.
yes, a single if() more or less per line is unlikely to make much of a differece, lines have hudreads of pixels normally so they, compared to pixels would only have a comparably small impact > As the most significant difference between the old and new code is that > the color conversion is separated from the horizontal scaling I merged > back the color conversion with the horizontal scaling and the performance > seemed to be on par with the original code again. > > One point I would like to comment is the performance measurement method. I > used 3 methods > 1 - using the scaling code, scale each line n times and measure the total > scaling time > this method was the most reliable as the measured time deviation between > different runs > was > 0.1%. > 2 - Call the scaling function n times, this method was not much reliable > with total time > deviation of 0.1% to 20%. > 3 - Run the program n times, measured time as not reliable deviation of > 10%-30%. > For all the 3 methods the time measurement as done for only the horizontal > scaling code. > > I think method 2 and 3 would be more close to real world usage but its > deviation is to high > to get any conclusion from its results. > > > Using method 1 with merge color conversion + horizontal scaling performance > seems to be > on par with the original code. > > Some numbers. Performance penalty %. (< 0 means gain) > these are not git patches > A - New code doesnt compile (but that doesnt matter as you say this is slower anyway) libswscale/swscale.c: In function ‘swscale’: libswscale/swscale.c:529:18: error: ‘i’ undeclared (first use in this function) > B - New code with merged color conversion and horizontal scaling time ./ffmpeg -i matrixbench_mpeg2.mpg -an -vf scale=1920:1080,scale=720:480 -f null - old code: real 0m20.730s real 0m20.763s real 0m20.765s new code: real 0m20.929s real 0m20.892s real 0m20.893s > C - B + line batches new code: real 0m20.730s real 0m20.690s real 0m20.683s also this seems well working except make -j4 libswscale/swscale-test gdb --args libswscale/swscale-test r bt #0 ff_rgbaToY_avx.loop () at libswscale/x86/input.asm:524 #1 0x000000000044cc17 in lum_h_scale1 (c=0x6d7100, desc=0x6e29a0, sliceY=6, sliceH=5) at libswscale/hscale.c:115 #2 0x00000000004059e9 in swscale (c=0x6d7100, src=0x7fffffffe120, srcStride=0x7fffffffe160, srcSliceY=0, srcSliceH=96, dst=0x7fffffffe140, dstStride=0x7fffffffe170) at libswscale/swscale.c:558 #3 0x00000000004082d0 in sws_scale (c=0x6d7100, srcSlice=0x7fffffffe330, srcStride=0x7fffffffe370, srcSliceY=0, srcSliceH=96, dst=0x7fffffffe350, dstStride=0x7fffffffe380) at libswscale/swscale.c:1205 #4 0x00000000004032c6 in main (argc=1, argv=0x7fffffffe4c8) at libswscale/swscale-test.c:402 (gdb) up #1 0x000000000044cc17 in lum_h_scale1 (c=0x6d7100, desc=0x6e29a0, sliceY=6, sliceH=5) at libswscale/hscale.c:115 115 c->lumToYV12(lBuf, src[0], src[1], src[2], srcW, pal); (gdb) print lBuf $1 = (uint8_t *) 0x6e0460 "" (gdb) print src[0] $2 = (const uint8_t *) 0x0 [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Asymptotically faster algorithms should always be preferred if you have asymptotical amounts of data
signature.asc
Description: Digital signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel