https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117266
--- Comment #6 from H. Peter Anvin <hpa at zytor dot com> --- On 10/22/24 13:53, pinskia at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117266 > > --- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > Also you are mixing two different issues together. > First is the multiply N*N->2N which is fixed with _BitInt by casting; though > maybe there should be a way added to get what the precission of the _BitInt > type is. Yes that is true. The abstraction problem is this: "how do I know what type to use for the type that is twice the size of foo_t, and how do I generate that type?" If that is possible with _BitInt (does something like _BitInt(sizeof(foo_t)*CHAR_BIT) work?) then that's *awesome*... > Second is the division which is solved by _BitInt by casting too. > By doing: > (((larger_type)low) | (((larger_type)high)<<shift))/divisor . > > But most of the targets don't support a full division here and your > __builtin_div2 won't actually be using the instruction; that is even for > x86_64, you only get 128/64->64 rather than 128/64->128 And THAT is exactly the point: *the two aren't equivalent.* Only the programmer knows when this instruction is usable, and for performance reasons, you *really, really* want to be able to use it when you as the programmer know, a priori, that you can. The closest equivalent for an *unsigned* value (signed is more complex!) would probably we something like: uint64_t div2(uint64_t hi, uint64_t lo, uint64_t divisor) { unsigned __int128 dividend = ((unsigned __int128)hi << 64) + lo; unsigned __int128 qq = dividend / divisor; if (qq >> 64) raise(SIGFPE); /* Is this even portable? */ return qq; } gcc produces: 0000000000000000 <div2>: 0: 53 push %rbx 1: 48 89 d1 mov %rdx,%rcx 4: 31 db xor %ebx,%ebx 6: 48 89 fa mov %rdi,%rdx 9: 48 89 f7 mov %rsi,%rdi c: 48 89 d6 mov %rdx,%rsi f: 48 89 ca mov %rcx,%rdx 12: 48 89 d9 mov %rbx,%rcx 15: e8 00 00 00 00 call 1a <div2+0x1a> 16: R_X86_64_PLT32 __udivti3-0x4 1a: 48 89 c3 mov %rax,%rbx 1d: 48 85 d2 test %rdx,%rdx 20: 75 0e jne 30 <div2+0x30> 22: 48 89 d8 mov %rbx,%rax 25: 5b pop %rbx 26: c3 ret 27: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 2e: 00 00 30: bf 08 00 00 00 mov $0x8,%edi 35: e8 00 00 00 00 call 3a <div2+0x3a> 36: R_X86_64_PLT32 raise-0x4 3a: 48 89 d8 mov %rbx,%rax 3d: 5b pop %rbx 3e: c3 ret ... as opposed to ... 0000000000000040 <div2_asm>: 40: 48 89 d1 mov %rdx,%rcx 43: 48 89 f0 mov %rsi,%rax 46: 48 89 fa mov %rdi,%rdx 49: 48 f7 f0 div %rax 4c: c3 ret Does this make more sense? If you can find a way to generate "div" here rather than a call to __udivti3 I would very much like to see it... -hpa