Re: Sourceware @ Conservancy - Year One
Hi Maxim, On Thu, May 30, 2024 at 12:18:38PM +0400, Maxim Kuvyrkov via Overseers wrote: > > On May 29, 2024, at 23:02, Mark Wielaard wrote: > > And a special thanks to ARM who have been using > > https://patchwork.sourceware.org/ to provide a pre-commit testing > > service for various projects. > > Thanks for the great update! > > Minor nitpick: pre-commit testing for AArch64 and AArch32 > architectures is provided by Linaro Toolchain Working Group (Linaro > TCWG). Sorry for getting the credit wrong. Proper credit is important. And in this case I really should have known. All pre-commit emails start with [Linaro-TCWG-CI]. I did think about just mentioning the individuals who made things happen. But then getting individual names wrong is even worse than getting corporation names wrong... Thanks Maxim for making the Linaro Toolchain Working Group pre-commit testing for AArch64 and AArch32 happen! Cheers, Mark
Re: Is fcommon related with performance optimization logic?
On 30/05/2024 04:26, Andrew Pinski via Gcc wrote: On Wed, May 29, 2024 at 7:13 PM 赵海峰 via Gcc wrote: Dear Sir/Madam, We found that running on intel SPR UnixBench compiled with gcc 10.3 performs worse than with gcc 8.5 for dhry2reg benchmark. I found it related with -fcommon option which is disabled in 10.3 by default. Fcommon will make global variables addresses in special order in bss section(watching by nm -n) whatever they are defined in source code. We are wondering if fcommon has some special performance optimization process? (I also post the subject to gcc-help. Hope to get some suggestion in this mail list. Sorry for bothering.) This was already filed as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114532 . But someone needs to go in and do more analysis of what is going wrong. The biggest difference for x86_64 is how the variables are laid out and by who (the compiler or the linker). There is some notion that -fno-common increases the number of L1-dcache-load-misses and that points to the layout of the variable differences causing the difference. But nobody has gone and seen which variables are laid out differently and why. I am suspecting that small changes in the code/variables would cause layout differences which will cause the cache misses which can cause the performance which is almost all by accident. I suspect adding -fdata-sections will cause another performance difference here too. And there is not much GCC can do about this since data layout is "hard" to do to get the best performance always. (I am most familiar with embedded systems with static linking, rather than dealing with GOT and other aspects of linking on big systems.) I think -fno-common should allow -fsection-anchors to do a much better job. If symbols are put in the common section, the compiler does not know their relative position until link time. But if they are in bss or data sections (with or without -fdata-sections), it can at least use anchors to access data in the translation unit that defines the data objects. David Thanks, Andrew Pinski Best regards. Clark Zhao
gcc-12-20240530 is now available
Snapshot gcc-12-20240530 is now available on https://gcc.gnu.org/pub/gcc/snapshots/12-20240530/ and on various mirrors, see https://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 12 git branch with the following options: git://gcc.gnu.org/git/gcc.git branch releases/gcc-12 revision e26f16424f6279662efb210bc87c77148e956fed You'll find: gcc-12-20240530.tar.xz Complete GCC SHA256=e4b060b7f3684cee039d7aed953f57ac6b4c07b077aac1547cd790b503d145fe SHA1=5291fdf96726bb19f99aec4fe83abca2cbaa0096 Diffs from 12-20240523 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-12 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Question about generating vpmovzxbd instruction without using the interfaces in immintrin.h
Hi, I've recently been trying to hand-write code to trigger automatic vectorization optimizations in GCC on Intel x86 machines (without using the interfaces in immintrin.h), but I'm running into a problem where I can't seem to get the concise `vpmovzxbd` or similar instructions. My requirement is to convert 8 `uint8_t` elements to `int32_t` type and print the output. If I use the interface (_mm256_cvtepu8_epi32) in immintrin.h, the code is as follows: int immintrin () { int size = 1, offset = 3; uint8_t* a = malloc(sizeof(char) * size); __v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset)); for (int i = 0; i < 8; i++) { printf("%d\n", b[i]); } } After compiling with -mavx2 -O3, you can get concise and efficient instructions. (You can see it here: https://godbolt.org/z/8ojzdav47) But if I do not use this interface and instead use a for-loop or the `__builtin_convertvector` interface provided by GCC, I cannot achieve the above effect. The code is as follows: typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8))); int forloop () { int size = 1, offset = 3; uint8_t* a = malloc(sizeof(char) * size); v8qiu av = *(v8qiu *)(a + offset); __v8si b = {}; for (int i = 0; i < 8; i++) { b[i] = (a + offset)[i]; } for (int i = 0; i < 8; i++) { printf("%d\n", b[i]); } } int builtin_cvt () { int size = 1, offset = 3; uint8_t* a = malloc(sizeof(char) * size); v8qiu av = *(v8qiu *)(a + offset); __v8si b = __builtin_convertvector(av, __v8si); for (int i = 0; i < 8; i++) { printf("%d\n", b[i]); } } The instructions generated by both functions are redundant and complex, and are quite difficult to read compared to calling `_mm256_cvtepu8_epi32` directly. (You can see it here as well: https://godbolt.org/z/8ojzdav47) What I want to ask is: How should I write the source code to get assembly instructions similar to directly calling _mm256_cvtepu8_epi32? Or would it be easier if I modified the GIMPLE directly? But it seems that there is no relevant expression or interface directly corresponding to `vpmovzxbd` in GIMPLE. Thanks Hanke Zhang
Re: Question about generating vpmovzxbd instruction without using the interfaces in immintrin.h
On Fri, May 31, 2024 at 10:58 AM Hanke Zhang via Gcc wrote: > > Hi, > I've recently been trying to hand-write code to trigger automatic > vectorization optimizations in GCC on Intel x86 machines (without > using the interfaces in immintrin.h), but I'm running into a problem > where I can't seem to get the concise `vpmovzxbd` or similar > instructions. > > My requirement is to convert 8 `uint8_t` elements to `int32_t` type > and print the output. If I use the interface (_mm256_cvtepu8_epi32) in > immintrin.h, the code is as follows: > > int immintrin () { > int size = 1, offset = 3; > uint8_t* a = malloc(sizeof(char) * size); > > __v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset)); > > for (int i = 0; i < 8; i++) { > printf("%d\n", b[i]); > } > } > > After compiling with -mavx2 -O3, you can get concise and efficient > instructions. (You can see it here: https://godbolt.org/z/8ojzdav47) > > But if I do not use this interface and instead use a for-loop or the > `__builtin_convertvector` interface provided by GCC, I cannot achieve > the above effect. The code is as follows: > > typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8))); > int forloop () { > int size = 1, offset = 3; > uint8_t* a = malloc(sizeof(char) * size); > > v8qiu av = *(v8qiu *)(a + offset); > __v8si b = {}; > for (int i = 0; i < 8; i++) { > b[i] = (a + offset)[i]; > } > > for (int i = 0; i < 8; i++) { > printf("%d\n", b[i]); > } > } > > int builtin_cvt () { > int size = 1, offset = 3; > uint8_t* a = malloc(sizeof(char) * size); > > v8qiu av = *(v8qiu *)(a + offset); > __v8si b = __builtin_convertvector(av, __v8si); > > for (int i = 0; i < 8; i++) { > printf("%d\n", b[i]); > } > } > > The instructions generated by both functions are redundant and > complex, and are quite difficult to read compared to calling > `_mm256_cvtepu8_epi32` directly. (You can see it here as well: > https://godbolt.org/z/8ojzdav47) > > What I want to ask is: How should I write the source code to get > assembly instructions similar to directly calling > _mm256_cvtepu8_epi32? > > Or would it be easier if I modified the GIMPLE directly? But it seems > that there is no relevant expression or interface directly > corresponding to `vpmovzxbd` in GIMPLE. https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652484.html We're working on the patch to optimize __builtin_convertvector, after that it can be as optimal as intel intrinsic. > > Thanks > Hanke Zhang -- BR, Hongtao