Re: 22% degradation seen in embench:matmult-int

Visda.Vokhshoori--- via Gcc Thu, 13 Feb 2025 12:31:38 -0800

“the interchanged loop might for example no longer vectorize.”

The loops are not vectorized.  Which is ok, because this device doesn’t have 
the support for it.
I just don’t think a pass could single handedly make code slower that much.

Loop interchange is supposed to interchange the loop nest index with outer 
index to improve cache locality.  This is supposed to help -that is the next 
iteration we will have the data available in cache.

The benchmark source –and  the loop that gets interchanged is line 143

Source: 
https://github.com/embench/embench-iot/blob/master/src/matmult-int/matmult-int.c#L143

This loop is where most of the time is spent. But it would have been good if I 
had access to h/w tracing to see if the interchanged loop reduces cache misses 
as well as to see what is causing it to run this much slower.

Thanks for your reply!

From: Richard Biener <richard.guent...@gmail.com>
Date: Thursday, February 13, 2025 at 2:57 AM
To: Visda Vokhshoori - C51841 <visda.vokhsho...@microchip.com>
Cc: gcc@gcc.gnu.org <gcc@gcc.gnu.org>
Subject: Re: 22% degradation seen in embench:matmult-int
[You don't often get email from richard.guent...@gmail.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
content is safe

On Wed, Feb 12, 2025 at 4:38 PM Visda.Vokhshoori--- via Gcc
<gcc@gcc.gnu.org> wrote:
>
> Embench is used for benchmarking on embedded devices.
> This one project matmult-int has a function Multiply.  It’s a matrix 
> multiplication for 20 x 20 matrix.
> The device is a ATSAME70Q21B which is Cortex-M7
> The compiler is arm branch based on GCC version 13
> We are compiling with O3 which has loop-interchange pass on by default.
>
> When we compile with -fno-loop-interchange we get all 22% back plus 5% speed 
> up.
>
> When we do the loop interchange on the one loop nest that get interchanged it 
> is slightly (.7%) faster.
>
> Has anyone else seen large degradation as a result of loop interchange?

I would suggest to compare the -fopt-info diagnostic output with and
without -fno-loop-interchange,
the interchanged loop might for example no longer vectorize.  Other
than that - no, loop interchange
isn't applied very often and it has a very conservative cost model.

Are you able to share a testcase?

Richard.

>
> Thanks

Re: 22% degradation seen in embench:matmult-int

Reply via email to