Re: Sourceware @ Conservancy - Year One

2024-05-30 Thread Mark Wielaard
Hi Maxim,

On Thu, May 30, 2024 at 12:18:38PM +0400, Maxim Kuvyrkov via Overseers wrote:
> > On May 29, 2024, at 23:02, Mark Wielaard  wrote:
> > And a special thanks to ARM who have been using
> > https://patchwork.sourceware.org/ to provide a pre-commit testing
> > service for various projects.
> 
> Thanks for the great update!
> 
> Minor nitpick: pre-commit testing for AArch64 and AArch32
> architectures is provided by Linaro Toolchain Working Group (Linaro
> TCWG).

Sorry for getting the credit wrong. Proper credit is important. And in
this case I really should have known. All pre-commit emails start with
[Linaro-TCWG-CI]. I did think about just mentioning the individuals
who made things happen. But then getting individual names wrong is
even worse than getting corporation names wrong...

Thanks Maxim for making the Linaro Toolchain Working Group pre-commit
testing for AArch64 and AArch32 happen!

Cheers,

Mark


Re: Is fcommon related with performance optimization logic?

2024-05-30 Thread David Brown via Gcc

On 30/05/2024 04:26, Andrew Pinski via Gcc wrote:

On Wed, May 29, 2024 at 7:13 PM 赵海峰 via Gcc  wrote:


Dear Sir/Madam,


We found that running on intel SPR UnixBench compiled with gcc 10.3 performs 
worse than with gcc 8.5 for dhry2reg benchmark.


I found it related with -fcommon option which is disabled in 10.3 by default. 
Fcommon will make global variables addresses in special order in bss 
section(watching by nm -n) whatever they are defined in source code.


We are wondering if fcommon has some special performance optimization process?


(I also post the subject to gcc-help. Hope to get some suggestion in this mail 
list. Sorry for bothering.)


This was already filed as
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114532 . But someone
needs to go in and do more analysis of what is going wrong. The
biggest difference for x86_64 is how the variables are laid out and by
who (the compiler or the linker).  There is some notion that
-fno-common increases the number of L1-dcache-load-misses and that
points to the layout of the variable differences causing the
difference. But nobody has gone and seen which variables are laid out
differently and why. I am suspecting that small changes in the
code/variables would cause layout differences which will cause the
cache misses which can cause the performance which is almost all by
accident.
I suspect adding -fdata-sections will cause another performance
difference here too. And there is not much GCC can do about this since
data layout is "hard" to do to get the best performance always.



(I am most familiar with embedded systems with static linking, rather 
than dealing with GOT and other aspects of linking on big systems.)


I think -fno-common should allow -fsection-anchors to do a much better 
job.  If symbols are put in the common section, the compiler does not 
know their relative position until link time.  But if they are in bss or 
data sections (with or without -fdata-sections), it can at least use 
anchors to access data in the translation unit that defines the data 
objects.


David



Thanks,
Andrew Pinski




Best regards.


Clark Zhao






gcc-12-20240530 is now available

2024-05-30 Thread GCC Administrator via Gcc
Snapshot gcc-12-20240530 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/12-20240530/
and on various mirrors, see https://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 12 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch 
releases/gcc-12 revision e26f16424f6279662efb210bc87c77148e956fed

You'll find:

 gcc-12-20240530.tar.xz   Complete GCC

  SHA256=e4b060b7f3684cee039d7aed953f57ac6b4c07b077aac1547cd790b503d145fe
  SHA1=5291fdf96726bb19f99aec4fe83abca2cbaa0096

Diffs from 12-20240523 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-12
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Question about generating vpmovzxbd instruction without using the interfaces in immintrin.h

2024-05-30 Thread Hanke Zhang via Gcc
Hi,
I've recently been trying to hand-write code to trigger automatic
vectorization optimizations in GCC on Intel x86 machines (without
using the interfaces in immintrin.h), but I'm running into a problem
where I can't seem to get the concise `vpmovzxbd` or similar
instructions.

My requirement is to convert 8 `uint8_t` elements to `int32_t` type
and print the output. If I use the interface (_mm256_cvtepu8_epi32) in
immintrin.h, the code is as follows:

int immintrin () {
int size = 1, offset = 3;
uint8_t* a = malloc(sizeof(char) * size);

__v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset));

for (int i = 0; i < 8; i++) {
printf("%d\n", b[i]);
}
}

After compiling with -mavx2 -O3, you can get concise and efficient
instructions. (You can see it here: https://godbolt.org/z/8ojzdav47)

But if I do not use this interface and instead use a for-loop or the
`__builtin_convertvector` interface provided by GCC, I cannot achieve
the above effect. The code is as follows:

typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8)));
int forloop () {
int size = 1, offset = 3;
uint8_t* a = malloc(sizeof(char) * size);

v8qiu av = *(v8qiu *)(a + offset);
__v8si b = {};
for (int i = 0; i < 8; i++) {
b[i] = (a + offset)[i];
}

for (int i = 0; i < 8; i++) {
printf("%d\n", b[i]);
}
}

int builtin_cvt () {
int size = 1, offset = 3;
uint8_t* a = malloc(sizeof(char) * size);

v8qiu av = *(v8qiu *)(a + offset);
__v8si b = __builtin_convertvector(av, __v8si);

for (int i = 0; i < 8; i++) {
printf("%d\n", b[i]);
}
}

The instructions generated by both functions are redundant and
complex, and are quite difficult to read compared to calling
`_mm256_cvtepu8_epi32` directly. (You can see it here as well:
https://godbolt.org/z/8ojzdav47)

What I want to ask is: How should I write the source code to get
assembly instructions similar to directly calling
_mm256_cvtepu8_epi32?

Or would it be easier if I modified the GIMPLE directly? But it seems
that there is no relevant expression or interface directly
corresponding to `vpmovzxbd` in GIMPLE.

Thanks
Hanke Zhang


Re: Question about generating vpmovzxbd instruction without using the interfaces in immintrin.h

2024-05-30 Thread Hongtao Liu via Gcc
On Fri, May 31, 2024 at 10:58 AM Hanke Zhang via Gcc  wrote:
>
> Hi,
> I've recently been trying to hand-write code to trigger automatic
> vectorization optimizations in GCC on Intel x86 machines (without
> using the interfaces in immintrin.h), but I'm running into a problem
> where I can't seem to get the concise `vpmovzxbd` or similar
> instructions.
>
> My requirement is to convert 8 `uint8_t` elements to `int32_t` type
> and print the output. If I use the interface (_mm256_cvtepu8_epi32) in
> immintrin.h, the code is as follows:
>
> int immintrin () {
> int size = 1, offset = 3;
> uint8_t* a = malloc(sizeof(char) * size);
>
> __v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset));
>
> for (int i = 0; i < 8; i++) {
> printf("%d\n", b[i]);
> }
> }
>
> After compiling with -mavx2 -O3, you can get concise and efficient
> instructions. (You can see it here: https://godbolt.org/z/8ojzdav47)
>
> But if I do not use this interface and instead use a for-loop or the
> `__builtin_convertvector` interface provided by GCC, I cannot achieve
> the above effect. The code is as follows:
>
> typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8)));
> int forloop () {
> int size = 1, offset = 3;
> uint8_t* a = malloc(sizeof(char) * size);
>
> v8qiu av = *(v8qiu *)(a + offset);
> __v8si b = {};
> for (int i = 0; i < 8; i++) {
> b[i] = (a + offset)[i];
> }
>
> for (int i = 0; i < 8; i++) {
> printf("%d\n", b[i]);
> }
> }
>
> int builtin_cvt () {
> int size = 1, offset = 3;
> uint8_t* a = malloc(sizeof(char) * size);
>
> v8qiu av = *(v8qiu *)(a + offset);
> __v8si b = __builtin_convertvector(av, __v8si);
>
> for (int i = 0; i < 8; i++) {
> printf("%d\n", b[i]);
> }
> }
>
> The instructions generated by both functions are redundant and
> complex, and are quite difficult to read compared to calling
> `_mm256_cvtepu8_epi32` directly. (You can see it here as well:
> https://godbolt.org/z/8ojzdav47)
>
> What I want to ask is: How should I write the source code to get
> assembly instructions similar to directly calling
> _mm256_cvtepu8_epi32?
>
> Or would it be easier if I modified the GIMPLE directly? But it seems
> that there is no relevant expression or interface directly
> corresponding to `vpmovzxbd` in GIMPLE.
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652484.html
We're working on the patch to optimize __builtin_convertvector, after
that it can be as optimal as intel intrinsic.
>
> Thanks
> Hanke Zhang



-- 
BR,
Hongtao