>> Using 128-bit broadcasts is preferable over duplicating the constants >> to 256-bit unless there's a good reason for doing so since it wastes >> less cache and is faster on AMD CPU:s. > > What would that reason be? Afaik broadcasts are expensive, since they > both load from memory then splat data across lanes. Using them inside > loops doesn't sound like a good idea. But i guess you have more > experience testing with more varied chips than i do.
128-bit broadcasts from memory are done in the load unit for free on all AVX2-capable CPU:s. > Also, by AMD cpus you mean Ryzen? Because on Bulldozer-based CPUs we > purposely disabled functions using ymm regs. Yes. 128-bit broadcasts have twice the throughput compared to 256-bit loads on Ryzen since it only has 128-bit load units. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel