Hi, When examining the performance of some test cases on s390 I realized that we could do better for constructs like 2-byte memcpys or 2-byte/4-byte memsets. Due to some s390-specific architectural properties, we could be faster by e.g. avoiding excessive unrolling and using dedicated memory instructions (or similar).
For 1-byte memset/memcpy the builtin functions provide a straightforward way to achieve this. At first sight it seemed possible to extend tree-loop-distribution.c to include the additional variants we need. However, multibyte memsets/memcpys are not covered by the C standard and I'm therefore unsure if such an approach is preferable or if there are more idiomatic ways or places where to add the functionality. The same question goes for 2-byte strlen. I didn't see a recognition pattern for strlen (apart from optimizations due to known string length in tree-ssa-strlen.c). Would it make sense to include strlen recognition and subsequently handling for 2-byte strlen? The situation might of course more complicated than memset because of encodings etc. My snippet in question used a fixed-length encoding of 2 bytes, however. Another simple idea to tackle this would be a peephole optimization but I'm not sure if this is really feasible for something like memset. Wouldn't the peephole have to be recursive then? Regards Robin