On 8/5/25 12:44, Tomasz Kaminski wrote:
On Tue, Aug 5, 2025 at 12:39 PM Tomasz Kaminski <tkami...@redhat.com> wrote:
On Tue, Aug 5, 2025 at 12:33 PM Luc Grosheintz <luc.groshei...@gmail.com>
wrote:
On 8/5/25 09:36, Tomasz Kaminski wrote:
On Sun, Aug 3, 2025 at 11:07 PM Luc Grosheintz <
luc.groshei...@gmail.com>
wrote:
The methods layout_{left,right}::mapping::stride are defined
as
\prod_{i = 0}^r E[i]
\prod_{i = r+1}^n E[i]
This is computed as the product of a precomputed static product and the
product of the required dynamic extents.
Disassembly shows that even for low-rank extents, i.e. rank == 1 and
rank == 2, with at least one dynamic extent, the generated code loads
two values; and then runs the loop over at most one element, e.g. for
stride_left_d5 defined below the generated code is:
220: 48 8b 04 f5 00 00 00 mov rax,QWORD PTR [rsi*8+0x0]
227: 00
228: 31 d2 xor edx,edx
22a: 48 85 c0 test rax,rax
22d: 74 23 je 252 <stride_left_d5+0x32>
22f: 48 8b 0c f5 00 00 00 mov rcx,QWORD PTR [rsi*8+0x0]
236: 00
237: 48 c1 e1 02 shl rcx,0x2
23b: 74 13 je 250 <stride_left_d5+0x30>
23d: 48 01 f9 add rcx,rdi
240: 48 63 17 movsxd rdx,DWORD PTR [rdi]
243: 48 83 c7 04 add rdi,0x4
247: 48 0f af c2 imul rax,rdx
24b: 48 39 f9 cmp rcx,rdi
24e: 75 f0 jne 240 <stride_left_d5+0x20>
250: 89 c2 mov edx,eax
252: 89 d0 mov eax,edx
254: c3 ret
If there's no dynamic extents, it simply loads the precomputed product
of static extents.
For rank == 1 the answer is the constant `1`; for rank == 2 it's
either 1
or
extents.extent(k), with k == 0 for layout_left and k == 1 for
layout_right.
Consider,
using Ed = std::extents<int, dyn>;
int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t r)
{ return m.stride(r); }
using E3d = std::extents<int, 3, dyn>;
int stride_left_3d(const std::layout_left::mapping<E3d>& m, size_t
r)
{ return m.stride(r); }
using Ed5 = std::extents<int, dyn, 5>;
int stride_left_d5(const std::layout_left::mapping<Ed5>& m, size_t
r)
{ return m.stride(r); }
The optimized code for these three cases is:
0000000000000060 <stride_left_d>:
60: b8 01 00 00 00 mov eax,0x1
65: c3 ret
0000000000000090 <stride_left_3d>:
90: 48 83 fe 01 cmp rsi,0x1
94: 19 c0 sbb eax,eax
96: 83 e0 fe and eax,0xfffffffe
99: 83 c0 03 add eax,0x3
9c: c3 ret
00000000000000a0 <stride_left_d5>:
a0: b8 01 00 00 00 mov eax,0x1
a5: 48 85 f6 test rsi,rsi
a8: 74 02 je ac <stride_left_d5+0xc>
aa: 8b 07 mov eax,DWORD PTR [rdi]
ac: c3 ret
For rank == 1 it simply returns 1 (as expected). For rank == 2, it
either implements a branchless formula, or conditionally loads one
value. In all cases involving a dynamic extent this seems like it's
always doing clearly less work, both in terms of computation and loads.
In cases not involving a dynamic extent, it replaces loading one value
with a branchless sequence of four instructions.
This commit also refactors __size to no use any of the precomputed
arrays. This prevents instantiating __{fwd,rev}_partial_prods for
low-rank extents. This results in a further size reduction of a
reference object file (described two commits prior) by 9% from 46.0kB
to
41.9kB.
In a prior commit we optimized __size to produce better object code by
precomputing the static products. This refactor enables the optimizer
to
generate the same optimized code.
libstdc++-v3/ChangeLog:
* include/std/mdspan (__mdspan::__fwd_prod): Optimize
for rank <= 2.
(__mdspan::__rev_prod): Ditto.
(__mdspan::__size): Refactor to use a pre-computed product,
not
a partial product.
Signed-off-by: Luc Grosheintz <luc.groshei...@gmail.com>
---
libstdc++-v3/include/std/mdspan | 32 ++++++++++++++++++++++++++------
1 file changed, 26 insertions(+), 6 deletions(-)
diff --git a/libstdc++-v3/include/std/mdspan
b/libstdc++-v3/include/std/mdspan
index dc1b44ee35c..11a5063f60d 100644
--- a/libstdc++-v3/include/std/mdspan
+++ b/libstdc++-v3/include/std/mdspan
@@ -423,25 +423,45 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
constexpr typename _Extents::index_type
__fwd_prod(const _Extents& __exts, size_t __r) noexcept
{
+ constexpr size_t __rank = _Extents::rank();
constexpr auto __sta_exts = __static_extents<_Extents>();
- size_t __sta_prod = __fwd_partial_prods<__sta_exts>[__r];
- return __extents_prod(__exts, __sta_prod, 0, __r);
+ if constexpr (__rank == 1)
+ return 1;
+ else if constexpr (__rank == 2)
+ return __r == 0 ? 1 : __exts.extent(0);
+ else
+ {
+ size_t __sta_prod = __fwd_partial_prods<__sta_exts>[__r];
+ return __extents_prod(__exts, __sta_prod, 0, __r);
+ }
}
template<typename _Extents>
constexpr typename _Extents::index_type
__rev_prod(const _Extents& __exts, size_t __r) noexcept
{
- constexpr auto __sta_exts = __static_extents<_Extents>();
constexpr size_t __rank = _Extents::rank();
- size_t __sta_prod = __rev_partial_prods<__sta_exts>[__r];
- return __extents_prod(__exts, __sta_prod, __r + 1, __rank);
+ constexpr auto __sta_exts = __static_extents<_Extents>();
+ if constexpr (__rank == 1)
+ return 1;
+ else if constexpr (__rank == 2)
+ return __r == 0 ? __exts.extent(1) : 1;
+ else
+ {
+ size_t __sta_prod = __rev_partial_prods<__sta_exts>[__r];
+ return __extents_prod(__exts, __sta_prod, __r + 1, __rank);
+ }
}
template<typename _Extents>
constexpr typename _Extents::index_type
__size(const _Extents& __exts) noexcept
- { return __fwd_prod(__exts, __exts.rank()); }
+ {
+ constexpr auto __sta_exts = __static_extents<_Extents>();
I am also changing this to constexpr auto&, there is no need to make a
copy
of aray.
The optimizer agrees: there's no need to make a copy and then correctly
chooses to not make a copy.
You cannot make a constexpr span covering it, unless it is static:
#include <array>
#include <span>
int main() {
constexpr std::array<int, 5> x{};
constexpr std::span<const int> s = x; // ill-formed
}
Link here: https://godbolt.org/z/4MYoY5noq
And if we made it static, we will emit additional symbol for the array,
this is why I have
changed this to references.
Thank you! I didn't realize it was a consequence of needing to
create a span. I got stuck on the "it creates a copy" part.
I can see why it makes sense. However, having disassembled mdspan related
code several times, I got the feeling that this is an easy optimization
for
GCC. Therefore, it didn't register as a concern. Please let me know if
there's something I missed (it's quite likely).
The dilemma is: either it doesn't matter (then why change it); or it does
matter which implies that it could change the generated code. If it's the
latter I need to recheck everything, because I superstitiously believe I'm
always "unlucky" if I "wing it" in these situations.
You've mentioned a separate commit, would it make sense to put these
changes
into a [PATCH 9/8]? That way the source code will soon have the preferred
form, and whatever mistakes are in the commit messages will have been my
own
doing.
I have already updated the local patches with the changes, because
otherwise I couldn't
use the span as an argument for __all_static/__all_dynamic.
If we're doing this to optimize -O0, then I think it should be a different
task, because there will be a lot more to fix. Personally, I've given up
on -O0 in scientific codes, it's easily 10x slower and that then ended up
hovering on the threshold to unacceptably slow.
+ constexpr size_t __rank = _Extents::rank();
+ constexpr size_t __sta_prod = __static_prod<__sta_exts>(0,
__rank);
+ return __extents_prod(__exts, __sta_prod, 0, __rank);
+ }
template<typename _IndexType, size_t... _Counts>
auto __build_dextents_type(integer_sequence<size_t,
_Counts...>)
--
2.50.0