On Tue, Aug 5, 2025 at 12:39 PM Tomasz Kaminski <tkami...@redhat.com> wrote:

>
>
> On Tue, Aug 5, 2025 at 12:33 PM Luc Grosheintz <luc.groshei...@gmail.com>
> wrote:
>
>>
>>
>> On 8/5/25 09:36, Tomasz Kaminski wrote:
>> > On Sun, Aug 3, 2025 at 11:07 PM Luc Grosheintz <
>> luc.groshei...@gmail.com>
>> > wrote:
>> >
>> >> The methods layout_{left,right}::mapping::stride are defined
>> >> as
>> >>
>> >>    \prod_{i = 0}^r E[i]
>> >>    \prod_{i = r+1}^n E[i]
>> >>
>> >> This is computed as the product of a precomputed static product and the
>> >> product of the required dynamic extents.
>> >>
>> >> Disassembly shows that even for low-rank extents, i.e. rank == 1 and
>> >> rank == 2, with at least one dynamic extent, the generated code loads
>> >> two values; and then runs the loop over at most one element, e.g. for
>> >> stride_left_d5 defined below the generated code is:
>> >>
>> >>   220:  48 8b 04 f5 00 00 00   mov    rax,QWORD PTR [rsi*8+0x0]
>> >>   227:  00
>> >>   228:  31 d2                  xor    edx,edx
>> >>   22a:  48 85 c0               test   rax,rax
>> >>   22d:  74 23                  je     252 <stride_left_d5+0x32>
>> >>   22f:  48 8b 0c f5 00 00 00   mov    rcx,QWORD PTR [rsi*8+0x0]
>> >>   236:  00
>> >>   237:  48 c1 e1 02            shl    rcx,0x2
>> >>   23b:  74 13                  je     250 <stride_left_d5+0x30>
>> >>   23d:  48 01 f9               add    rcx,rdi
>> >>   240:  48 63 17               movsxd rdx,DWORD PTR [rdi]
>> >>   243:  48 83 c7 04            add    rdi,0x4
>> >>   247:  48 0f af c2            imul   rax,rdx
>> >>   24b:  48 39 f9               cmp    rcx,rdi
>> >>   24e:  75 f0                  jne    240 <stride_left_d5+0x20>
>> >>   250:  89 c2                  mov    edx,eax
>> >>   252:  89 d0                  mov    eax,edx
>> >>   254:  c3                     ret
>> >>
>> >> If there's no dynamic extents, it simply loads the precomputed product
>> >> of static extents.
>> >>
>> >> For rank == 1 the answer is the constant `1`; for rank == 2 it's
>> either 1
>> >> or
>> >> extents.extent(k), with k == 0 for layout_left and k == 1 for
>> >> layout_right.
>> >>
>> >> Consider,
>> >>
>> >>    using Ed = std::extents<int, dyn>;
>> >>    int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t r)
>> >>    { return m.stride(r); }
>> >>
>> >>    using E3d = std::extents<int, 3, dyn>;
>> >>    int stride_left_3d(const std::layout_left::mapping<E3d>& m, size_t
>> r)
>> >>    { return m.stride(r); }
>> >>
>> >>    using Ed5 = std::extents<int, dyn, 5>;
>> >>    int stride_left_d5(const std::layout_left::mapping<Ed5>& m, size_t
>> r)
>> >>    { return m.stride(r); }
>> >>
>> >> The optimized code for these three cases is:
>> >>
>> >>    0000000000000060 <stride_left_d>:
>> >>    60:  b8 01 00 00 00         mov    eax,0x1
>> >>    65:  c3                     ret
>> >>
>> >>    0000000000000090 <stride_left_3d>:
>> >>    90:  48 83 fe 01            cmp    rsi,0x1
>> >>    94:  19 c0                  sbb    eax,eax
>> >>    96:  83 e0 fe               and    eax,0xfffffffe
>> >>    99:  83 c0 03               add    eax,0x3
>> >>    9c:  c3                     ret
>> >>
>> >>    00000000000000a0 <stride_left_d5>:
>> >>    a0:  b8 01 00 00 00         mov    eax,0x1
>> >>    a5:  48 85 f6               test   rsi,rsi
>> >>    a8:  74 02                  je     ac <stride_left_d5+0xc>
>> >>    aa:  8b 07                  mov    eax,DWORD PTR [rdi]
>> >>    ac:  c3                     ret
>> >>
>> >> For rank == 1 it simply returns 1 (as expected). For rank == 2, it
>> >> either implements a branchless formula, or conditionally loads one
>> >> value. In all cases involving a dynamic extent this seems like it's
>> >> always doing clearly less work, both in terms of computation and loads.
>> >> In cases not involving a dynamic extent, it replaces loading one value
>> >> with a branchless sequence of four instructions.
>> >>
>> >> This commit also refactors __size to no use any of the precomputed
>> >> arrays. This prevents instantiating __{fwd,rev}_partial_prods for
>> >> low-rank extents. This results in a further size reduction of a
>> >> reference object file (described two commits prior) by 9% from 46.0kB
>> to
>> >> 41.9kB.
>> >>
>> >> In a prior commit we optimized __size to produce better object code by
>> >> precomputing the static products. This refactor enables the optimizer
>> to
>> >> generate the same optimized code.
>> >>
>> >> libstdc++-v3/ChangeLog:
>> >>
>> >>          * include/std/mdspan (__mdspan::__fwd_prod): Optimize
>> >>          for rank <= 2.
>> >>          (__mdspan::__rev_prod): Ditto.
>> >>          (__mdspan::__size): Refactor to use a pre-computed product,
>> not
>> >>          a partial product.
>> >>
>> >> Signed-off-by: Luc Grosheintz <luc.groshei...@gmail.com>
>> >> ---
>> >>   libstdc++-v3/include/std/mdspan | 32 ++++++++++++++++++++++++++------
>> >>   1 file changed, 26 insertions(+), 6 deletions(-)
>> >>
>> >> diff --git a/libstdc++-v3/include/std/mdspan
>> >> b/libstdc++-v3/include/std/mdspan
>> >> index dc1b44ee35c..11a5063f60d 100644
>> >> --- a/libstdc++-v3/include/std/mdspan
>> >> +++ b/libstdc++-v3/include/std/mdspan
>> >> @@ -423,25 +423,45 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >>         constexpr typename _Extents::index_type
>> >>         __fwd_prod(const _Extents& __exts, size_t __r) noexcept
>> >>         {
>> >> +       constexpr size_t __rank = _Extents::rank();
>> >>          constexpr auto __sta_exts = __static_extents<_Extents>();
>> >> -       size_t __sta_prod = __fwd_partial_prods<__sta_exts>[__r];
>> >> -       return __extents_prod(__exts, __sta_prod, 0, __r);
>> >> +       if constexpr (__rank == 1)
>> >> +         return 1;
>> >> +       else if constexpr (__rank == 2)
>> >> +         return __r == 0 ? 1 : __exts.extent(0);
>> >> +       else
>> >> +         {
>> >> +           size_t __sta_prod = __fwd_partial_prods<__sta_exts>[__r];
>> >> +           return __extents_prod(__exts, __sta_prod, 0, __r);
>> >> +         }
>> >>         }
>> >>
>> >>       template<typename _Extents>
>> >>         constexpr typename _Extents::index_type
>> >>         __rev_prod(const _Extents& __exts, size_t __r) noexcept
>> >>         {
>> >> -       constexpr auto __sta_exts = __static_extents<_Extents>();
>> >>          constexpr size_t __rank = _Extents::rank();
>> >> -       size_t __sta_prod = __rev_partial_prods<__sta_exts>[__r];
>> >> -       return __extents_prod(__exts, __sta_prod, __r + 1, __rank);
>> >> +       constexpr auto __sta_exts = __static_extents<_Extents>();
>> >> +       if constexpr (__rank == 1)
>> >> +         return 1;
>> >> +       else if constexpr (__rank == 2)
>> >> +         return __r == 0 ? __exts.extent(1) : 1;
>> >> +       else
>> >> +         {
>> >> +           size_t __sta_prod = __rev_partial_prods<__sta_exts>[__r];
>> >> +           return __extents_prod(__exts, __sta_prod, __r + 1, __rank);
>> >> +         }
>> >>         }
>> >>
>> >>       template<typename _Extents>
>> >>         constexpr typename _Extents::index_type
>> >>         __size(const _Extents& __exts) noexcept
>> >> -      { return __fwd_prod(__exts, __exts.rank()); }
>> >> +      {
>> >> +       constexpr auto __sta_exts = __static_extents<_Extents>();
>> >>
>> > I am also changing this to constexpr auto&, there is no need to make a
>> copy
>> > of aray.
>>
>> The optimizer agrees: there's no need to make a copy and then correctly
>> chooses to not make a copy.
>>
> You cannot make a constexpr span covering it, unless it is static:
> #include <array>
> #include <span>
>
> int main() {
> constexpr std::array<int, 5> x{};
> constexpr std::span<const int> s = x; // ill-formed
> }
> Link here: https://godbolt.org/z/4MYoY5noq
>
C++26 symbolic addressing may change that, but I am not 100% certain.
Using a reference, makes making a span not an issue.

>
>> I can see why it makes sense. However, having disassembled mdspan related
>> code several times, I got the feeling that this is an easy optimization
>> for
>> GCC. Therefore, it didn't register as a concern. Please let me know if
>> there's something I missed (it's quite likely).
>>
>> The dilemma is: either it doesn't matter (then why change it); or it does
>> matter which implies that it could change the generated code. If it's the
>> latter I need to recheck everything, because I superstitiously believe I'm
>> always "unlucky" if I "wing it" in these situations.
>>
>> You've mentioned a separate commit, would it make sense to put these
>> changes
>> into a [PATCH 9/8]? That way the source code will soon have the preferred
>> form, and whatever mistakes are in the commit messages will have been my
>> own
>> doing.
>>
>> If we're doing this to optimize -O0, then I think it should be a different
>> task, because there will be a lot more to fix. Personally, I've given up
>> on -O0 in scientific codes, it's easily 10x slower and that then ended up
>> hovering on the threshold to unacceptably slow.
>>
>> >
>> >> +       constexpr size_t __rank = _Extents::rank();
>> >> +       constexpr size_t __sta_prod = __static_prod<__sta_exts>(0,
>> __rank);
>> >> +       return __extents_prod(__exts, __sta_prod, 0, __rank);
>> >> +      }
>> >>
>> >>       template<typename _IndexType, size_t... _Counts>
>> >>         auto __build_dextents_type(integer_sequence<size_t,
>> _Counts...>)
>> >> --
>> >> 2.50.0
>> >>
>> >>
>> >
>>
>>

Reply via email to