On Tue, Aug 5, 2025 at 12:39 PM Tomasz Kaminski <tkami...@redhat.com> wrote:
> > > On Tue, Aug 5, 2025 at 12:33 PM Luc Grosheintz <luc.groshei...@gmail.com> > wrote: > >> >> >> On 8/5/25 09:36, Tomasz Kaminski wrote: >> > On Sun, Aug 3, 2025 at 11:07 PM Luc Grosheintz < >> luc.groshei...@gmail.com> >> > wrote: >> > >> >> The methods layout_{left,right}::mapping::stride are defined >> >> as >> >> >> >> \prod_{i = 0}^r E[i] >> >> \prod_{i = r+1}^n E[i] >> >> >> >> This is computed as the product of a precomputed static product and the >> >> product of the required dynamic extents. >> >> >> >> Disassembly shows that even for low-rank extents, i.e. rank == 1 and >> >> rank == 2, with at least one dynamic extent, the generated code loads >> >> two values; and then runs the loop over at most one element, e.g. for >> >> stride_left_d5 defined below the generated code is: >> >> >> >> 220: 48 8b 04 f5 00 00 00 mov rax,QWORD PTR [rsi*8+0x0] >> >> 227: 00 >> >> 228: 31 d2 xor edx,edx >> >> 22a: 48 85 c0 test rax,rax >> >> 22d: 74 23 je 252 <stride_left_d5+0x32> >> >> 22f: 48 8b 0c f5 00 00 00 mov rcx,QWORD PTR [rsi*8+0x0] >> >> 236: 00 >> >> 237: 48 c1 e1 02 shl rcx,0x2 >> >> 23b: 74 13 je 250 <stride_left_d5+0x30> >> >> 23d: 48 01 f9 add rcx,rdi >> >> 240: 48 63 17 movsxd rdx,DWORD PTR [rdi] >> >> 243: 48 83 c7 04 add rdi,0x4 >> >> 247: 48 0f af c2 imul rax,rdx >> >> 24b: 48 39 f9 cmp rcx,rdi >> >> 24e: 75 f0 jne 240 <stride_left_d5+0x20> >> >> 250: 89 c2 mov edx,eax >> >> 252: 89 d0 mov eax,edx >> >> 254: c3 ret >> >> >> >> If there's no dynamic extents, it simply loads the precomputed product >> >> of static extents. >> >> >> >> For rank == 1 the answer is the constant `1`; for rank == 2 it's >> either 1 >> >> or >> >> extents.extent(k), with k == 0 for layout_left and k == 1 for >> >> layout_right. >> >> >> >> Consider, >> >> >> >> using Ed = std::extents<int, dyn>; >> >> int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t r) >> >> { return m.stride(r); } >> >> >> >> using E3d = std::extents<int, 3, dyn>; >> >> int stride_left_3d(const std::layout_left::mapping<E3d>& m, size_t >> r) >> >> { return m.stride(r); } >> >> >> >> using Ed5 = std::extents<int, dyn, 5>; >> >> int stride_left_d5(const std::layout_left::mapping<Ed5>& m, size_t >> r) >> >> { return m.stride(r); } >> >> >> >> The optimized code for these three cases is: >> >> >> >> 0000000000000060 <stride_left_d>: >> >> 60: b8 01 00 00 00 mov eax,0x1 >> >> 65: c3 ret >> >> >> >> 0000000000000090 <stride_left_3d>: >> >> 90: 48 83 fe 01 cmp rsi,0x1 >> >> 94: 19 c0 sbb eax,eax >> >> 96: 83 e0 fe and eax,0xfffffffe >> >> 99: 83 c0 03 add eax,0x3 >> >> 9c: c3 ret >> >> >> >> 00000000000000a0 <stride_left_d5>: >> >> a0: b8 01 00 00 00 mov eax,0x1 >> >> a5: 48 85 f6 test rsi,rsi >> >> a8: 74 02 je ac <stride_left_d5+0xc> >> >> aa: 8b 07 mov eax,DWORD PTR [rdi] >> >> ac: c3 ret >> >> >> >> For rank == 1 it simply returns 1 (as expected). For rank == 2, it >> >> either implements a branchless formula, or conditionally loads one >> >> value. In all cases involving a dynamic extent this seems like it's >> >> always doing clearly less work, both in terms of computation and loads. >> >> In cases not involving a dynamic extent, it replaces loading one value >> >> with a branchless sequence of four instructions. >> >> >> >> This commit also refactors __size to no use any of the precomputed >> >> arrays. This prevents instantiating __{fwd,rev}_partial_prods for >> >> low-rank extents. This results in a further size reduction of a >> >> reference object file (described two commits prior) by 9% from 46.0kB >> to >> >> 41.9kB. >> >> >> >> In a prior commit we optimized __size to produce better object code by >> >> precomputing the static products. This refactor enables the optimizer >> to >> >> generate the same optimized code. >> >> >> >> libstdc++-v3/ChangeLog: >> >> >> >> * include/std/mdspan (__mdspan::__fwd_prod): Optimize >> >> for rank <= 2. >> >> (__mdspan::__rev_prod): Ditto. >> >> (__mdspan::__size): Refactor to use a pre-computed product, >> not >> >> a partial product. >> >> >> >> Signed-off-by: Luc Grosheintz <luc.groshei...@gmail.com> >> >> --- >> >> libstdc++-v3/include/std/mdspan | 32 ++++++++++++++++++++++++++------ >> >> 1 file changed, 26 insertions(+), 6 deletions(-) >> >> >> >> diff --git a/libstdc++-v3/include/std/mdspan >> >> b/libstdc++-v3/include/std/mdspan >> >> index dc1b44ee35c..11a5063f60d 100644 >> >> --- a/libstdc++-v3/include/std/mdspan >> >> +++ b/libstdc++-v3/include/std/mdspan >> >> @@ -423,25 +423,45 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >> >> constexpr typename _Extents::index_type >> >> __fwd_prod(const _Extents& __exts, size_t __r) noexcept >> >> { >> >> + constexpr size_t __rank = _Extents::rank(); >> >> constexpr auto __sta_exts = __static_extents<_Extents>(); >> >> - size_t __sta_prod = __fwd_partial_prods<__sta_exts>[__r]; >> >> - return __extents_prod(__exts, __sta_prod, 0, __r); >> >> + if constexpr (__rank == 1) >> >> + return 1; >> >> + else if constexpr (__rank == 2) >> >> + return __r == 0 ? 1 : __exts.extent(0); >> >> + else >> >> + { >> >> + size_t __sta_prod = __fwd_partial_prods<__sta_exts>[__r]; >> >> + return __extents_prod(__exts, __sta_prod, 0, __r); >> >> + } >> >> } >> >> >> >> template<typename _Extents> >> >> constexpr typename _Extents::index_type >> >> __rev_prod(const _Extents& __exts, size_t __r) noexcept >> >> { >> >> - constexpr auto __sta_exts = __static_extents<_Extents>(); >> >> constexpr size_t __rank = _Extents::rank(); >> >> - size_t __sta_prod = __rev_partial_prods<__sta_exts>[__r]; >> >> - return __extents_prod(__exts, __sta_prod, __r + 1, __rank); >> >> + constexpr auto __sta_exts = __static_extents<_Extents>(); >> >> + if constexpr (__rank == 1) >> >> + return 1; >> >> + else if constexpr (__rank == 2) >> >> + return __r == 0 ? __exts.extent(1) : 1; >> >> + else >> >> + { >> >> + size_t __sta_prod = __rev_partial_prods<__sta_exts>[__r]; >> >> + return __extents_prod(__exts, __sta_prod, __r + 1, __rank); >> >> + } >> >> } >> >> >> >> template<typename _Extents> >> >> constexpr typename _Extents::index_type >> >> __size(const _Extents& __exts) noexcept >> >> - { return __fwd_prod(__exts, __exts.rank()); } >> >> + { >> >> + constexpr auto __sta_exts = __static_extents<_Extents>(); >> >> >> > I am also changing this to constexpr auto&, there is no need to make a >> copy >> > of aray. >> >> The optimizer agrees: there's no need to make a copy and then correctly >> chooses to not make a copy. >> > You cannot make a constexpr span covering it, unless it is static: > #include <array> > #include <span> > > int main() { > constexpr std::array<int, 5> x{}; > constexpr std::span<const int> s = x; // ill-formed > } > Link here: https://godbolt.org/z/4MYoY5noq > C++26 symbolic addressing may change that, but I am not 100% certain. Using a reference, makes making a span not an issue. > >> I can see why it makes sense. However, having disassembled mdspan related >> code several times, I got the feeling that this is an easy optimization >> for >> GCC. Therefore, it didn't register as a concern. Please let me know if >> there's something I missed (it's quite likely). >> >> The dilemma is: either it doesn't matter (then why change it); or it does >> matter which implies that it could change the generated code. If it's the >> latter I need to recheck everything, because I superstitiously believe I'm >> always "unlucky" if I "wing it" in these situations. >> >> You've mentioned a separate commit, would it make sense to put these >> changes >> into a [PATCH 9/8]? That way the source code will soon have the preferred >> form, and whatever mistakes are in the commit messages will have been my >> own >> doing. >> >> If we're doing this to optimize -O0, then I think it should be a different >> task, because there will be a lot more to fix. Personally, I've given up >> on -O0 in scientific codes, it's easily 10x slower and that then ended up >> hovering on the threshold to unacceptably slow. >> >> > >> >> + constexpr size_t __rank = _Extents::rank(); >> >> + constexpr size_t __sta_prod = __static_prod<__sta_exts>(0, >> __rank); >> >> + return __extents_prod(__exts, __sta_prod, 0, __rank); >> >> + } >> >> >> >> template<typename _IndexType, size_t... _Counts> >> >> auto __build_dextents_type(integer_sequence<size_t, >> _Counts...>) >> >> -- >> >> 2.50.0 >> >> >> >> >> > >> >>