On Tue, Apr 6, 2021 at 2:51 AM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > Do you know what of the three changes (preferring reps/stosb, > > > CLEAR_RATIO and algorithm choice changes) cause the two speedups > > > on eebmc? > > > > A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP > > > > This loop is transformed to builtin_memcpy and builtin_memset with size 280. > > > > Current strategy for skylake is {512, unrolled_loop, false} for such > > size, so it will generate unrolled loops with mov, while the patch > > generates memcpy/memset libcall and uses vector move. > > This is good - I originally set the table based on this > micro-benchmarking script and apparently glibc used at that time had > more expensive memcpy for small blocks. > > One thing to consider is, however, that calling external memcpy has also > additional cost of clobbering all caller saved registers. Especially > for code that uses SSE this is painful since all needs to go to stack in > that case. So I am not completely sure how representative the > micro-benchmark is to this respect since it does not use any SSE and > register pressure is generally small. > > So with current glibc it seems libcall is win for blocks of size greater > than 64 or 128 at least if the register pressure is not big. > With this respect your change looks good. > > > > > > My patch generates "rep movsb" only in a very limited cases: > > > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > > load and store for up to 16 * 16 (256) bytes when the data size is > > > fixed and known. > > > 2. Inline only if data size is known to be <= 256. > > > a. Use "rep movsb/stosb" with a simple code sequence if the data size > > > is a constant. > > > b. Use loop if data size is not a constant. > > Aha, this is very hard to read from the algorithm descriptor. So we > still have the check that maxsize==minsize and use rep mosb only for > constant sized blocks when the corresponding TARGET macro is defined. > > I think it would be more readable if we introduced rep_1_byte_constant. > The descriptor is supposed to read as a sequence of rules where fist > applies. It is not obvious that we have another TARGET_* macro that > makes rep_1_byte to be ignored in some cases. > (TARGET macro will also interfere with the microbenchmarking script). > > Still I do not understand why compile time constant makes rep mosb/stosb > better than loop. Is it CPU special casing it at decoder time and > requiring explicit mov instruction? Or is it only becuase rep mosb is > not good for blocks smaller than 128bit?
Non constant "rep movsb" triggers more machine clear events: https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/mo-machine-clear-overhead.html in hot loops of some workloads. > > > > > > As a result, "rep stosb" is generated only when 128 < data size < 256 > > > with -mno-sse. > > > > > > > Do you have some data for blocks in size 8...256 to be faster with rep1 > > > > compared to unrolled loop for perhaps more real world benchmarks? > > > > > > "rep movsb" isn't generated with my patch in this case since > > > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with > > > XMM registers. > > OK, so I guess: > {libcall, > {{256, rep_1_byte, true}, > {256, unrolled_loop, false}, > {-1, libcall, false}}}, > {libcall, > {{256, rep_1_loop, true}, > {256, unrolled_loop, false}, > {-1, libcall, false}}}}; > > may still perform better but the differnece between loop and unrolled > loop is within 10% margin.. > > So i guess patch is OK and we should look into cleaning up the > descriptors. I can make patch for that once I understand the logic above. I am checking in my patch. We improve it for GCC 12. We will also revisit: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90773 for GCC 12. Thanks. -- H.J.