Re: [PATCH v2 0/8] libstdc++: Sequence of improvements to .

Tomasz Kaminski Tue, 05 Aug 2025 05:50:21 -0700

On Tue, Aug 5, 2025 at 2:14 PM Luc Grosheintz <luc.groshei...@gmail.com>
wrote:


>
>
> On 8/5/25 13:25, Tomasz Kaminski wrote:
> > On Tue, Aug 5, 2025 at 1:14 PM Luc Grosheintz <luc.groshei...@gmail.com>
> > wrote:
> >
> >>
> >>
> >> On 8/5/25 10:16, Tomasz Kaminski wrote:
> >>> Hi,
> >>>
> >>> I have posted v3 patches with changes I have made locally for first 6
> >>> patches, and I think this series
> >>> is ready to land, in addition to
> >>> https://gcc.gnu.org/pipermail/libstdc++/2025-July/062727.html, that I
> >>> already reviewed.
> >>> I will keep aligned accessor on top of it.
> >>>
> >>> As the separate commit, we should update the __fwd_partial_prod and
> >>> __rev_partial_prod computation
> >>> to use fact that this is partial prod, and we have previous partial
> >>> product computed:
> >>>       template<array _Extents>
> >>>         constexpr auto __fwd_partial_prods = [] consteval
> >>>           {
> >>>             constexpr size_t __rank = _Extents.size();
> >>>             std::array<size_t, __rank + 1> __ret;
> >>>             __ret[0] = 1;
> >>>             for(size_t __r = 1; __r < __rank + 1; ++__r)
> >>>               if (_Extents[__r] != dynamic_extents)
> >>>                 __ret[__r] = __ret[0] * _Extents[__r];
> >>>             return __ret;
> >>>           }();
> >>> We are doing this at compile time, but this should help composition
> >> speed.
> >>> I would then inline __static_prod loop into the __mdspan::__size
> function
> >>> and remove that function
> >>> entirely.
> >>
> >> I checked both approaches, and it didn't affect the compile time in
> >> the slightest, so I decided it wasn't worth it; and went with the
> >> less error prone solution. OTOH, you're right, knowingly leaving in
> >> something that's accidentally quadratic is begging for some obscure
> >> problem to arise. Moreover, measuring these can be deceptive, because
> >> they have a tendency to be fast enough, right up to the point where it
> >> becomes prohibitively expensive.
> >>
> > Note that we are talking about the compile-time computation, that  is
> done
> > by an interpreter. It will never be costfull at runtime. And remove the
> > unnecessary
> > element from __fwd_partial_prod that should be the size of __rank instead
> > of __rank + 1.
>
> Yes, but leaving in code that could bring the compiler / compilation
> to a grinding halt doesn't seems smart either (in hindsight) :)
>
> >
> >>
> >> If you want I can fix this up in a separate patch.
> >>
> >>>
> >>>
> >>> I would be interested if we could reduce the __fwd/rev_partial_prod
> >> sizes,
> >>> but not keep the outermost dimensions there,
> >>> i.e. adding a check for __r == 0 early in the __fwd_prod funciton.
> >>>           constexpr size_t __rank = _Extents::rank();
> >>>           constexpr auto& __sta_exts = __static_extents<_Extents>();
> >>>           if constexpr (__rank == 1)
> >>>             return 1;
> >>>          // new if here, that is extracted from __rank == 2
> >>>           else if (__r == 0)
> >>>             return 1;
> >>>           else if constexpr (__rank == 2)
> >>>             return _exts.extent(0);
> >>>           else if constexpr
> >>> (__all_dynamic(std::span(__sta_exts).first(__rank-1)))
> >>>             return __extents_prod(__exts, 1, 0, __r);
> >>>           else
> >>>             {
> >>>               size_t __sta_prod =
> __fwd_partial_prods<__sta_exts>[__r-1];
> >> //
> >>> size reduce by one here
> >>>               return __extents_prod(__exts, __sta_prod, 0, __r);
> >>>             }
> >>
> >> To me it's not obvious how this one will go. Intuitively, this adds a
> >> comparison and branching to avoid loading 8 bytes. Compared to the other
> >> size reductions, this has a lower upper limit as to how effective it's
> >> going to be.
> >>
> > My idea here is that we will avoid storing 8 bytes per possible
> combination
> > of extents values, to store a constant that always has a value of 1.
> > This seems wasteful to me. So, I would go for it if:
> > * produces the same code for rank == 0, as before
> > * does not prevent using vectorization for all dynamic extents with, just
> > with
> >    addition of single if.
>
> Since you really don't like wasting space: in many cases the 7 high
> bytes of size_t will be zero (or 0xff...ff). What's stopping us from
> storing [3, 5, 7] in an array<uint8_t>? We need to be very careful
> handling dynamic_extent; but I think it's only used in checks, e.g.
>
>    _Extent[i] == dynamic_extent
>
> and that can be translated to
>
>    _Extent[i] == decltype(_Extent[i])(-1)
>
> and we've saved 8x .rodata space =)
>
Actually, using the product of all static extents to determine the element
type
__fwd_partial_prod/__rev_partial_prod sounds like something we could explore
if someone complains about the sizes. Same for the _S_dynamic_index_data.

>
> >
> >
> >> I don't think I'll work on this, since I don't have the required test
> >> cases to properly measure this to justify the change; and I don't trust
> >> myself to make the right call without very exhaustive measurements. I
> >> also don't have access to a suitably large variety of hardware.
> >>
> >>>
> >>> Similarly for __rev_prod:
> >>>           constexpr size_t __rank = _Extents::rank();
> >>>           constexpr auto& __sta_exts = __static_extents<_Extents>();
> >>>           if constexpr (__rank == 1)
> >>>             return 1;
> >>>          else if (__r == __rank - 1)
> >>>             return 1;
> >>>           else if constexpr (__rank == 2)
> >>>             return __exts.extent(1);
> >>>           else if constexpr
> >>> (__all_dynamic(std::span(__sta_exts).last(__rank-1)))
> >>>             return __extents_prod(__exts, 1, __r + 1, __rank);
> >>>           else
> >>>             {
> >>>               size_t __sta_prod =
> __rev_partial_prods<__sta_exts>[__r-1];
> >> //
> >>> size reduced by one here.
> >>>               return __extents_prod(__exts, __sta_prod, __r + 1,
> __rank);
> >>>             }
> >>>
> >>> Regards,
> >>> Tomasz
> >>>
> >>>
> >>> On Mon, Aug 4, 2025 at 7:51 PM Luc Grosheintz <
> luc.groshei...@gmail.com>
> >>> wrote:
> >>>
> >>>>
> >>>>
> >>>> On 8/4/25 17:42, Tomasz Kaminski wrote:
> >>>>> On Mon, Aug 4, 2025 at 1:14 PM Tomasz Kaminski <tkami...@redhat.com>
> >>>> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Aug 4, 2025 at 1:08 PM Luc Grosheintz <
> >> luc.groshei...@gmail.com
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Tomasz,
> >>>>>>>
> >>>>>>> Thank you for the review! Sorry about the missing parens, even
> after
> >>>>>>> "Ctrl+F"ing each of the emails I can't find the requires clause
> (the
> >>>>>>> two I found both need the parens).
> >>>>>>>
> >>>>>> I was mentioning this one, but it is possible that I am wrong on
> this
> >>>> one,
> >>>>>> I haven't yet checked it, as I was planning to do it during full
> >> review.
> >>>>>>        template<array _Extents>
> >>>>>>          requires (__all_dynamic<_Extents>())
> >>>>>>          class _StaticExtents<_Extents>;
> >>>>>> But regardless, I will make that change locally if it does work, so
> >> you
> >>>> do
> >>>>>> not
> >>>>>> need to update patch series for it.
> >>>>>>
> >>>>> Parenthesis seem to be indeed required here, so my comment was a red
> >>>>> herring.
> >>>>
> >>>> I'm pretty sure that this is the reason why I wrongly concluded
> >>>> that the pattern is `requires(cond)`, i.e. the parens are always
> >>>> needed.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> Do you think it's possible to merge this series before the other
> two
> >>>>>>> outstanding patch series? There's some risk of collision; making
> >>>> changes
> >>>>>>> to this patch series is much more time consuming than rebasing (and
> >>>>>>> retesting the other two patch series).
> >>>>>>>
> >>>>>> This sounds very reasonable. I will prepare a stack of patches in
> your
> >>>>>> suggested
> >>>>>> order locally, and push them once they are approved. But it still
> make
> >>>>>>
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>>
> >>>>>> Thank you for your continued contributions.
> >>>>>>
> >>>>>>> Luc
> >>>>>>>
> >>>>>>> On 8/4/25 12:43, Tomasz Kaminski wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I this time I made a quick pass through all changes, before
> >> commenting
> >>>>>>> on
> >>>>>>>> first commits,
> >>>>>>>> and they look solid to me, and I haven't noticed anything I would
> >> like
> >>>>>>> to
> >>>>>>>> change (except parentheses
> >>>>>>>> around requires, but I will handle that locally). I will try to
> do a
> >>>>>>> full
> >>>>>>>> review during this week.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Tomasz
> >>>>>>>>
> >>>>>>>> On Sun, Aug 3, 2025 at 10:59 PM Luc Grosheintz <
> >>>>>>> luc.groshei...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> The combined effect of this sequence of change is:
> >>>>>>>>>
> >>>>>>>>>       * a reduction in the number of template instantiations, by
> >>>>>>>>>         - avoiding needless dependency of IndexType,
> >>>>>>>>>         - special formulas for low-rank extents,
> >>>>>>>>>         - special formulas for (nearly) fully dynamic extents.
> >>>>>>>>>
> >>>>>>>>>       * improved code quality, by
> >>>>>>>>>         - precomputing partial products of the static extents,
> >>>>>>>>>         - special cases for low-rank extents,
> >>>>>>>>>         - rewriting the condition E[i] == dynamic_extent in a
> more
> >>>>>>>>>           optimizer friendly manner.
> >>>>>>>>>         - effectively loop-unrolling extents::operator==.
> >>>>>>>>>
> >>>>>>>>> While simplistic micro-benchmarking shows the effectiveness of
> >> these
> >>>>>>>>> changes, likely the stronger argument is presented in each
> commit:
> >>>>>>>>>       a) each change removes needless complexity,
> >>>>>>>>>       b) before/after examples of generated code show the
> >>>> effectiveness.
> >>>>>>>>>
> >>>>>>>>> Luc Grosheintz (8):
> >>>>>>>>>       libstdc++: Reduce template instantiations in <mdspan>.
> >>>>>>>>>       libstdc++: Precompute products of static extents.
> >>>>>>>>>       libstdc++: Improve low-rank layout_{left,right}::stride.
> >>>>>>>>>       libstdc++: Improve fully dynamic extents in mdspan.
> >>>>>>>>>       libstdc++: Improve nearly fully dynamic extents in mdspan.
> >>>>>>>>>       libstdc++: Reduce indirection in extents::extent.
> >>>>>>>>>       libstdc++: Improve extents::operator==.
> >>>>>>>>>       libstdc++: Replace numeric_limit with __int_traits in
> mdspan.
> >>>>>>>>>
> >>>>>>>>>      libstdc++-v3/include/std/mdspan               | 282
> >>>>>>> +++++++++++++-----
> >>>>>>>>>      .../mdspan/extents/class_mandates_neg.cc      |   3 +
> >>>>>>>>>      2 files changed, 208 insertions(+), 77 deletions(-)
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> 2.50.0
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: [PATCH v2 0/8] libstdc++: Sequence of improvements to .

Reply via email to