"Nathan Sidwell" <nat...@acm.org> wrote: > On 8/14/20 12:43 PM, Stefan Kanthak wrote: >> Hi @ll, >> >> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>, >> Matt Godbolt used the function >> >> | bool isWhitespace(char c) >> | { >> | return c == ' ' >> | || c == '\r' >> | || c == '\n' >> | || c == '\t'; >> | } >> >> as an example, for which GCC 9.1 emits the following assembly for AMD64 >> processors (see <https://godbolt.org/z/acm19_conds>): >> >> | xor eax, eax ; result = false >> | cmp dil, 32 ; is c > 32 >> | ja .L4 ; if so, exit with false >> | movabs rax, 4294977024 ; rax = 0x100002600 >> | shrx rax, rax, rdi ; rax >>= c >> | and eax, 1 ; result = rax & 1 >> |.L4: >> | ret >> >> This code is but not optimal! > > What evidence do you have that your alternative sequence performs > better?
45+ years experience in writing assembly code! > Have you benchmarked it? Of course! Did you? I didn't include the numbers in my initial post since I don't have a processor which supports BMI2 and thus can't run the original code. I benchmarked the following equivalent code (input character is in ECX instead of EDI): GCC: x64: x86: xor eax, eax movabs rax, 100002600h mov eax, 2600h cmp cl, 32 test cl, cl ja L4 setnz al movabs rax, 100002600h shr rax, cl shr eax, cl shr rax, cl xor edx, edx xor edx, edx and eax, 1 cmp ecx, 33 cmp edx, 33 setb dl setb dl L4: and eax, edx and eax, edx ret ret ret Left column 1 billion sequential characters for (int i=1000000000; i; --i) ...(i); right column 1 billion random characters, in cycles per character: GCC: 2.4 3.4 x64: 3.0 2.5 x86: 4.3 4.5 > (I tried, but your code doesn't assemble) Wrong: YOUR code doesn't assemble, mine does! And it works well too. > It is more instructions Because I dared to show code for the old(er) i386 alias x86 processor, not for the AMD64 alias x86_64. I expect people commenting on assembly to understand (just a little bit) of it and to get the idea of the code I posted first, then eventually be able to come up with the following x86_64 code (without a fancy SHRX, i.e. for ALL AMD64 processors, not just those supporting BMI2): mov ecx, edi movabs rax, 100002600h ; rax = (1 << ' ') | (1 << '\r') | (1 << '\n') | (1 << '\t') shr rax, cl ; rax >>= c xor edi, edi cmp ecx, 33 ; CF = c <= ' ' .if 0 adc edi, edi ; edi = (c <= ' ') .else setb dil .endif and eax, edi ; result = rax & (c <= ' ') ret If someone (or GCC) but really insists to (ab)use SHRX: xor ecx, ecx cmp dil, 33 ; is !(c > 32) setb cl movabs rax, 4294977024 ; rax = 0x100002600 shrx rax, rax, rdi ; rax >>= c and eax, ecx ; result = rax & (c <= ' ') ret Now start counting: first the instructions, then the cycles lost due to wrong branch prediction (hint: 0 for my code). > and cannot speculate past the setnz OUCH! Unfortunately, AMD64 and i386 processors are not THAT retarded. > (As I understand it, x86_64 speculates branch instructions, but > doesn't speculate cmov -- so perversely branches are faster!) There's no CMOV in my code. What are you trying to demonstrate? JFTR: how much speculative executed instructions are discarded if the branch prediction is wrong? >> The following equivalent and branchless code works on i386 too, >> it needs neither an AMD64 processor nor the SHRX instruction, >> which is not available on older processors: > >> >> mov ecx, edi >> mov eax, 2600h ; eax = (1 << '\r') | (1 << '\n') | (1 << >> '\t') >> test cl, cl >> setnz al ; eax |= (c != '\0') >> shr eax, cl ; eax >>= (c % ' ') > > ^^ operand type mismatch on this instruction I recommend to fix YOUR typo! I bet you wrote "shr eax, c" instead of "shr eax, cl" there. >> xor edx, edx >> cmp ecx, 33 ; CF = c <= ' ' >> adc edx, edx ; edx = (c <= ' ') >> and eax, edx >> ret Stefan