https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449
Bug ID: 114449 Summary: bswap64 not optimized Product: gcc Version: 13.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: pali at kernel dot org Target Milestone: --- https://godbolt.org/z/dc3br9dYT gcc 13.2 with -O3 does not detect straightforward code for bswap64 functionality. It generates unoptimized code. uint64_t bswap64_1(uint64_t num) { uint64_t ret = 0; for (size_t i = 0; i < sizeof(num); i++) { ret |= ((num >> (8*(sizeof(num)-1-i))) & 0xff) << (8*i); } return ret; } Rewriting the code to manually unpack the loop cause that gcc produces optimized code with single "bswap" instruction on x86-64. uint64_t bswap64_2(uint64_t num) { uint64_t ret = 0; ret |= (((num >> 56) & 0xff) << 0); ret |= (((num >> 48) & 0xff) << 8); ret |= (((num >> 40) & 0xff) << 16); ret |= (((num >> 32) & 0xff) << 24); ret |= (((num >> 24) & 0xff) << 32); ret |= (((num >> 16) & 0xff) << 40); ret |= (((num >> 8) & 0xff) << 48); ret |= (((num >> 0) & 0xff) << 56); return ret; } Additional -funroll-all-loops argument for the first example does not help and still produces unoptimized code.