On Tue, Jan 5, 2016 at 10:46 AM, Ganesh Ajjanagadde <gajja...@mit.edu> wrote: > On Tue, Jan 5, 2016 at 10:10 AM, Daniel Serpell <dserp...@gmail.com> wrote: >> Hi!, >> >> El Tue, Jan 05, 2016 at 08:08:35AM -0800, Ganesh Ajjanagadde escribio: >>> On Tue, Jan 5, 2016 at 7:44 AM, Daniel Serpell <dserp...@gmail.com> wrote: >>> > Hi!, >>> > >>> > El Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde escribio: >>> >> This exploits an approach based on the sieve of Eratosthenes, a popular >>> >> method for generating prime numbers. >>> >> >>> >> Tables are identical to previous ones. >>> >> >>> >> Tested with FATE with/without --enable-hardcoded-tables. >>> >> >>> >> Sample benchmark (Haswell, GNU/Linux+gcc): >>> >> prev: >>> >> 7860100 decicycles in cbrt_tableinit, 1 runs, 0 skips >>> >> 7777490 decicycles in cbrt_tableinit, 2 runs, 0 skips >>> >> [...] >>> >> 7582339 decicycles in cbrt_tableinit, 256 runs, 0 skips >>> >> 7563556 decicycles in cbrt_tableinit, 512 runs, 0 skips >>> >> >>> >> new: >>> >> 2099480 decicycles in cbrt_tableinit, 1 runs, 0 skips >>> >> 2044470 decicycles in cbrt_tableinit, 2 runs, 0 skips >>> >> [...] >>> >> 1796544 decicycles in cbrt_tableinit, 256 runs, 0 skips >>> >> 1791631 decicycles in cbrt_tableinit, 512 runs, 0 skips >>> >> >>> > >>> > See attached code, function "test1", based on an approximation of: >>> > >>> > (i+1)^(1/3) ~= i^(1/3) * ( 1 + 1/(3i) - 1/(9i) + 5/(81i) - .... ) [...]
So I looked up Mr. Fog's manuals, and it seems like divides are considerably slower than multiplies and adds. This made me somewhat skeptical of your idea, and so I benched it. Seems like your patch is actually significantly slower on my setup (gcc 5.3.0, GNU libm): 3349080 decicycles in cbrt_tableinit, 1 runs, 0 skips 3466605 decicycles in cbrt_tableinit, 2 runs, 0 skips [...] 3425939 decicycles in cbrt_tableinit, 256 runs, 0 skips 3414528 decicycles in cbrt_tableinit, 512 runs, 0 skips (clang 3.7.0): 3337590 decicycles in cbrt_tableinit, 1 runs, 0 skips 3276225 decicycles in cbrt_tableinit, 2 runs, 0 skips [...] 2871065 decicycles in cbrt_tableinit, 256 runs, 0 skips 2832137 decicycles in cbrt_tableinit, 512 runs, 0 skips The divides seem to be what is killing your approach. Times (just for the divisions, clang): 1085430 decicycles in cbrt_tableinit, 1 runs, 0 skips 1005165 decicycles in cbrt_tableinit, 2 runs, 0 skips [...] 863732 decicycles in cbrt_tableinit, 256 runs, 0 skips 864907 decicycles in cbrt_tableinit, 512 runs, 0 skips Another illustration, even if I change the 1.0/(3*i) to simply i to get rid of the multiplication as well, obviously not possible but really wishful thinking, you still only get: (clang) 2013390 decicycles in cbrt_tableinit, 1 runs, 0 skips 1950135 decicycles in cbrt_tableinit, 2 runs, 0 skips [...] 1668963 decicycles in cbrt_tableinit, 256 runs, 0 skips 1653179 decicycles in cbrt_tableinit, 512 runs, 0 skips To complete my curiousity, (gcc) 1485240 decicycles in cbrt_tableinit, 1 runs, 0 skips 1522115 decicycles in cbrt_tableinit, 2 runs, 0 skips [...] 1324325 decicycles in cbrt_tableinit, 256 runs, 0 skips 1322110 decicycles in cbrt_tableinit, 512 runs, 0 skips i.e not a whole lot faster than the benchmark I posted. If you feel this can't be right, I can do give a higher level justification, which shows that for a reasonable libm cbrt, and standard architectural assumptions, the approach I have is faster. > >> >> Daniel. >> >> _______________________________________________ >> ffmpeg-devel mailing list >> ffmpeg-devel@ffmpeg.org >> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel