On Thu, Jun 04, 2015 at 09:06:40PM +0100, Richard Earnshaw wrote: > On 04/06/15 20:57, Jakub Jelinek wrote: > > On Thu, Jun 04, 2015 at 06:36:33PM +0200, Ondřej Bílka wrote: > >> On Thu, Jun 04, 2015 at 04:01:50PM +0000, Joseph Myers wrote: > >>> On Thu, 4 Jun 2015, Richard Earnshaw wrote: > >>> > >>>>> Change that into > >>>>> > >>>>> int foo(char *s) > >>>>> { > >>>>> int l = strlen (s); > >>>>> char *p = memchr (s, 'a', l); > >>>>> return p+l; > >>>>> } > > > >> And Joseph you shouldn't restrict yourself only to values that are > >> present in variables to cover case where its implicit one from strcpy > >> converted to stpcpy. > > > > memchr isn't handled in that pass right now at all, of course it could be > > added, shouldn't be really hard. Feel free to file a PR and/or write > > a patch. > > > > As for e.g. the inlining of the first (or a few more) iterations of strcmp > > etc., that is certainly something that can be done in the compiler too and > > the compiler should have much better information whether to do it or not, > > as it shouldn't be done for -Os, or for basic blocks or functions predicted > > cold, because it enlarges the code size quite a lot. > > You should also be wary of making the strings passed to the library > functions not naturally aligned. That can result in the code in the > library having to take a much slower path to regain any alignment done > by peeling the initial iteration(s). And if you're going to pass the > full string(s) anyway, then you'd better be pretty sure that doing the > check before the call is really likely to succeed. > You can for these functions. If that causes problem its because your implementation is already slow. Trying peeling is wrong way. With unaligned loads just check if you cross page boundary with both arguments and do unaligned load, no peeling necesary. For aligned loads its bit more tricky, you could emulate initial unaligned load by shifts, no peeling necessary.
As for strcmp you have following profile you are optimizing a cold code. As majority of 90% of strings are misaligned and relatively misaligned you need to optimize that. See following overall statistic, its bit skewed that make causes majority of strcmp's but also for most other programs alignment is rare. See following or more detailed strcmp profile in my previous mail. average size 21.8 calls 71742523 succeed 99.2% latencies -14.2 -14.0 s1 aligned to 4 bytes 26.6% aligned to 8 bytes 14.5% aligned to 16 bytes 8.1% s2 aligned to 4 bytes 49.9% aligned to 8 bytes 41.9% aligned to 16 bytes 37.2% s1-s2 aligned to 4 bytes 23.0% aligned to 8 bytes 11.2% aligned to 16 bytes 5.5% n <= 0: 32.4% n <= 1: 34.7% n <= 2: 35.6% n <= 3: 36.2% n <= 4: 36.2% n <= 8: 36.5% n <= 16: 38.9% n <= 32: 51.3% n <= 64: 100.0% As you on aarch use byte-by-byte loop for 94.5% of inputs its clear that you have slow implementation.