On Thu, Apr 03, 2025 at 12:14:04PM -0400, Yury Norov wrote: > On Thu, Apr 03, 2025 at 10:39:03PM +0800, Kuan-Wei Chiu wrote: > > On Tue, Mar 25, 2025 at 12:43:25PM -0700, H. Peter Anvin wrote: > > > On 3/23/25 08:16, Kuan-Wei Chiu wrote: > > > > > > > > Interface 3: Multiple Functions > > > > Description: bool parity_odd8/16/32/64() > > > > Pros: No need for explicit casting; easy to integrate > > > > architecture-specific optimizations; except for parity8(), all > > > > functions are one-liners with no significant code duplication > > > > Cons: More functions may increase maintenance burden > > > > Opinions: Only I support this approach > > > > > > > > > > OK, so I responded to this but I can't find my reply or any of the > > > followups, so let me go again: > > > > > > I prefer this option, because: > > > > > > a. Virtually all uses of parity is done in contexts where the sizes of the > > > items for which parity is to be taken are well-defined, but it is *really* > > > easy for integer promotion to cause a value to be extended to 32 bits > > > unnecessarily (sign or zero extend, although for parity it doesn't make > > > any > > > difference -- if the compiler realizes it.) > > > > > > b. It makes it easier to add arch-specific implementations, notably using > > > __builtin_parity on architectures where that is known to generate good > > > code. > > > > > > c. For architectures where only *some* parity implementations are > > > fast/practical, the generic fallbacks will either naturally synthesize > > > them > > > from components via shift-xor, or they can be defined to use a larger > > > version; the function prototype acts like a cast. > > > > > > d. If there is a reason in the future to add a generic version, it is > > > really > > > easy to do using the size-specific functions as components; this is > > > something we do literally all over the place, using a pattern so common > > > that > > > it, itself, probably should be macroized: > > > > > > #define parity(x) \ > > > ({ \ > > > typeof(x) __x = (x); \ > > > bool __y; \ > > > switch (sizeof(__x)) { \ > > > case 1: \ > > > __y = parity8(__x); \ > > > break; \ > > > case 2: \ > > > __y = parity16(__x); \ > > > break; \ > > > case 4: \ > > > __y = parity32(__x); \ > > > break; \ > > > case 8: \ > > > __y = parity64(__x); \ > > > break; \ > > > default: \ > > > BUILD_BUG(); \ > > > break; \ > > > } \ > > > __y; \ > > > }) > > > > > Thank you for your detailed response and for explaining the rationale > > behind your preference. The points you outlined in (a)–(d) all seem > > quite reasonable to me. > > > > Yury, > > do you have any feedback on this? > > Thank you. > > My feedback to you: > > I asked you to share any numbers about each approach. Asm listings, > performance tests, bloat-o-meter. But you did nothing or very little > in that department. You move this series, and it means you should be > very well aware of alternative solutions, their pros and cons. > It seems the concern is that I didn't provide assembly results and performance numbers. While I believe that listing these numbers alone cannot prove which users really care about parity efficiency, I have included the assembly results and my initial observations below. Some differences, like mov vs movzh, are likely difficult to measure.
Compilation on x86-64 using GCC 14.2 with O2 Optimization: Link to Godbolt: https://godbolt.org/z/EsqPMz8cq For u8 Input: - #2 and #3 generate exactly the same assembly code, while #1 replaces one `mov` instruction with `movzh`, which may slightly slow down the performance due to zero extension. - Efficiency: #2 = #3 > #1 For u16 Input: - As with u8 input, #1 performs an unnecessary zero extension, while #3 replaces one of the `shr` instructions in #2 with a `mov`, making it slightly faster. - Efficiency: #3 > #2 > #1 For u32 Input: - #1 has an additional `mov` instruction compared to #2, and #2 has an extra `shr` instruction compared to #3. - Efficiency: #3 > #2 > #1 For u64 Input: - #1 and #2 generate the same code, but #3 has one less `shr` instruction compared to the others. - Efficiency: #3 > #1 = #2 --- Adding -m32 Flag to View Assembly for 32-bit Machine: Link to Godbolt: https://godbolt.org/z/GrPa86Eq5 For u8 Input: - #2 and #3 generate identical assembly code, whereas #1 has additional `mov`, `shr`, and `push/pop` instructions. - Efficiency: #2 = #3 > #1 For u16 Input: - #1 uses a lot of `xmm` register operations, making it slower than #2 and #3. Additionally, #2 has an extra `shr` instruction compared to #3. - Efficiency: #3 > #2 > #1 For u32 Input: - #1 again uses a lot of `xmm` register operations, so it is slower than #2 and #3, and #2 has an additional `shr` instruction compared to #3. - Efficiency: #3 > #2 > #1 For u64 Input: - Both #1 and #2 use `xmm` register operations, but #1 has a few extra `movdqa` instructions. #3 is more concise, using a few `shr`, `xor`, and `mov` instructions to complete the operation. - Efficiency: #3 > #2 > #1 Regards, Kuan-Wei