Re: [PATCH v2 0/8] libstdc++: Sequence of improvements to .

Tomasz Kaminski Tue, 05 Aug 2025 05:56:21 -0700

On Tue, Aug 5, 2025 at 2:49 PM Tomasz Kaminski <tkami...@redhat.com> wrote:


>
>
> On Tue, Aug 5, 2025 at 2:14 PM Luc Grosheintz <luc.groshei...@gmail.com>
> wrote:
>
>>
>>
>> On 8/5/25 13:25, Tomasz Kaminski wrote:
>> > On Tue, Aug 5, 2025 at 1:14 PM Luc Grosheintz <luc.groshei...@gmail.com
>> >
>> > wrote:
>> >
>> >>
>> >>
>> >> On 8/5/25 10:16, Tomasz Kaminski wrote:
>> >>> Hi,
>> >>>
>> >>> I have posted v3 patches with changes I have made locally for first 6
>> >>> patches, and I think this series
>> >>> is ready to land, in addition to
>> >>> https://gcc.gnu.org/pipermail/libstdc++/2025-July/062727.html, that I
>> >>> already reviewed.
>> >>> I will keep aligned accessor on top of it.
>> >>>
>> >>> As the separate commit, we should update the __fwd_partial_prod and
>> >>> __rev_partial_prod computation
>> >>> to use fact that this is partial prod, and we have previous partial
>> >>> product computed:
>> >>>       template<array _Extents>
>> >>>         constexpr auto __fwd_partial_prods = [] consteval
>> >>>           {
>> >>>             constexpr size_t __rank = _Extents.size();
>> >>>             std::array<size_t, __rank + 1> __ret;
>> >>>             __ret[0] = 1;
>> >>>             for(size_t __r = 1; __r < __rank + 1; ++__r)
>> >>>               if (_Extents[__r] != dynamic_extents)
>> >>>                 __ret[__r] = __ret[0] * _Extents[__r];
>> >>>             return __ret;
>> >>>           }();
>> >>> We are doing this at compile time, but this should help composition
>> >> speed.
>> >>> I would then inline __static_prod loop into the __mdspan::__size
>> function
>> >>> and remove that function
>> >>> entirely.
>> >>
>> >> I checked both approaches, and it didn't affect the compile time in
>> >> the slightest, so I decided it wasn't worth it; and went with the
>> >> less error prone solution. OTOH, you're right, knowingly leaving in
>> >> something that's accidentally quadratic is begging for some obscure
>> >> problem to arise. Moreover, measuring these can be deceptive, because
>> >> they have a tendency to be fast enough, right up to the point where it
>> >> becomes prohibitively expensive.
>> >>
>> > Note that we are talking about the compile-time computation, that  is
>> done
>> > by an interpreter. It will never be costfull at runtime. And remove the
>> > unnecessary
>> > element from __fwd_partial_prod that should be the size of __rank
>> instead
>> > of __rank + 1.
>>
>> Yes, but leaving in code that could bring the compiler / compilation
>> to a grinding halt doesn't seems smart either (in hindsight) :)
>>
>> >
>> >>
>> >> If you want I can fix this up in a separate patch.
>> >>
>> >>>
>> >>>
>> >>> I would be interested if we could reduce the __fwd/rev_partial_prod
>> >> sizes,
>> >>> but not keep the outermost dimensions there,
>> >>> i.e. adding a check for __r == 0 early in the __fwd_prod funciton.
>> >>>           constexpr size_t __rank = _Extents::rank();
>> >>>           constexpr auto& __sta_exts = __static_extents<_Extents>();
>> >>>           if constexpr (__rank == 1)
>> >>>             return 1;
>> >>>          // new if here, that is extracted from __rank == 2
>> >>>           else if (__r == 0)
>> >>>             return 1;
>> >>>           else if constexpr (__rank == 2)
>> >>>             return _exts.extent(0);
>> >>>           else if constexpr
>> >>> (__all_dynamic(std::span(__sta_exts).first(__rank-1)))
>> >>>             return __extents_prod(__exts, 1, 0, __r);
>> >>>           else
>> >>>             {
>> >>>               size_t __sta_prod =
>> __fwd_partial_prods<__sta_exts>[__r-1];
>> >> //
>> >>> size reduce by one here
>> >>>               return __extents_prod(__exts, __sta_prod, 0, __r);
>> >>>             }
>> >>
>> >> To me it's not obvious how this one will go. Intuitively, this adds a
>> >> comparison and branching to avoid loading 8 bytes. Compared to the
>> other
>> >> size reductions, this has a lower upper limit as to how effective it's
>> >> going to be.
>> >>
>> > My idea here is that we will avoid storing 8 bytes per possible
>> combination
>> > of extents values, to store a constant that always has a value of 1.
>> > This seems wasteful to me. So, I would go for it if:
>> > * produces the same code for rank == 0, as before
>> > * does not prevent using vectorization for all dynamic extents with,
>> just
>> > with
>> >    addition of single if.
>>
>> Since you really don't like wasting space: in many cases the 7 high
>> bytes of size_t will be zero (or 0xff...ff). What's stopping us from
>> storing [3, 5, 7] in an array<uint8_t>? We need to be very careful
>> handling dynamic_extent; but I think it's only used in checks, e.g.
>>
>>    _Extent[i] == dynamic_extent
>>
>> and that can be translated to
>>
>>    _Extent[i] == decltype(_Extent[i])(-1)
>>
>> and we've saved 8x .rodata space =)
>>
> Actually, using the product of all static extents to determine the element
> type
> __fwd_partial_prod/__rev_partial_prod sounds like something we could
> explore
> if someone complains about the sizes. Same for the _S_dynamic_index_data.
>
As I was unclear, I agree that changing the code to remove 1 extra element
of the array
is premature, and we may have better avenues to reduce that.
So let's just clear up the __fwd_prod computations and remove unused
__rank+1
element.

> >
>> >
>> >> I don't think I'll work on this, since I don't have the required test
>> >> cases to properly measure this to justify the change; and I don't trust
>> >> myself to make the right call without very exhaustive measurements. I
>> >> also don't have access to a suitably large variety of hardware.
>> >>
>> >>>
>> >>> Similarly for __rev_prod:
>> >>>           constexpr size_t __rank = _Extents::rank();
>> >>>           constexpr auto& __sta_exts = __static_extents<_Extents>();
>> >>>           if constexpr (__rank == 1)
>> >>>             return 1;
>> >>>          else if (__r == __rank - 1)
>> >>>             return 1;
>> >>>           else if constexpr (__rank == 2)
>> >>>             return __exts.extent(1);
>> >>>           else if constexpr
>> >>> (__all_dynamic(std::span(__sta_exts).last(__rank-1)))
>> >>>             return __extents_prod(__exts, 1, __r + 1, __rank);
>> >>>           else
>> >>>             {
>> >>>               size_t __sta_prod =
>> __rev_partial_prods<__sta_exts>[__r-1];
>> >> //
>> >>> size reduced by one here.
>> >>>               return __extents_prod(__exts, __sta_prod, __r + 1,
>> __rank);
>> >>>             }
>> >>>
>> >>> Regards,
>> >>> Tomasz
>> >>>
>> >>>
>> >>> On Mon, Aug 4, 2025 at 7:51 PM Luc Grosheintz <
>> luc.groshei...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>>
>> >>>>
>> >>>> On 8/4/25 17:42, Tomasz Kaminski wrote:
>> >>>>> On Mon, Aug 4, 2025 at 1:14 PM Tomasz Kaminski <tkami...@redhat.com
>> >
>> >>>> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Mon, Aug 4, 2025 at 1:08 PM Luc Grosheintz <
>> >> luc.groshei...@gmail.com
>> >>>>>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Hi Tomasz,
>> >>>>>>>
>> >>>>>>> Thank you for the review! Sorry about the missing parens, even
>> after
>> >>>>>>> "Ctrl+F"ing each of the emails I can't find the requires clause
>> (the
>> >>>>>>> two I found both need the parens).
>> >>>>>>>
>> >>>>>> I was mentioning this one, but it is possible that I am wrong on
>> this
>> >>>> one,
>> >>>>>> I haven't yet checked it, as I was planning to do it during full
>> >> review.
>> >>>>>>        template<array _Extents>
>> >>>>>>          requires (__all_dynamic<_Extents>())
>> >>>>>>          class _StaticExtents<_Extents>;
>> >>>>>> But regardless, I will make that change locally if it does work, so
>> >> you
>> >>>> do
>> >>>>>> not
>> >>>>>> need to update patch series for it.
>> >>>>>>
>> >>>>> Parenthesis seem to be indeed required here, so my comment was a red
>> >>>>> herring.
>> >>>>
>> >>>> I'm pretty sure that this is the reason why I wrongly concluded
>> >>>> that the pattern is `requires(cond)`, i.e. the parens are always
>> >>>> needed.
>> >>>>
>> >>>>>
>> >>>>>>
>> >>>>>>> Do you think it's possible to merge this series before the other
>> two
>> >>>>>>> outstanding patch series? There's some risk of collision; making
>> >>>> changes
>> >>>>>>> to this patch series is much more time consuming than rebasing
>> (and
>> >>>>>>> retesting the other two patch series).
>> >>>>>>>
>> >>>>>> This sounds very reasonable. I will prepare a stack of patches in
>> your
>> >>>>>> suggested
>> >>>>>> order locally, and push them once they are approved. But it still
>> make
>> >>>>>>
>> >>>>>>>
>> >>>>>>> Thank you,
>> >>>>>>>
>> >>>>>> Thank you for your continued contributions.
>> >>>>>>
>> >>>>>>> Luc
>> >>>>>>>
>> >>>>>>> On 8/4/25 12:43, Tomasz Kaminski wrote:
>> >>>>>>>> Hi,
>> >>>>>>>>
>> >>>>>>>> I this time I made a quick pass through all changes, before
>> >> commenting
>> >>>>>>> on
>> >>>>>>>> first commits,
>> >>>>>>>> and they look solid to me, and I haven't noticed anything I would
>> >> like
>> >>>>>>> to
>> >>>>>>>> change (except parentheses
>> >>>>>>>> around requires, but I will handle that locally). I will try to
>> do a
>> >>>>>>> full
>> >>>>>>>> review during this week.
>> >>>>>>>>
>> >>>>>>>> Regards,
>> >>>>>>>> Tomasz
>> >>>>>>>>
>> >>>>>>>> On Sun, Aug 3, 2025 at 10:59 PM Luc Grosheintz <
>> >>>>>>> luc.groshei...@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> The combined effect of this sequence of change is:
>> >>>>>>>>>
>> >>>>>>>>>       * a reduction in the number of template instantiations, by
>> >>>>>>>>>         - avoiding needless dependency of IndexType,
>> >>>>>>>>>         - special formulas for low-rank extents,
>> >>>>>>>>>         - special formulas for (nearly) fully dynamic extents.
>> >>>>>>>>>
>> >>>>>>>>>       * improved code quality, by
>> >>>>>>>>>         - precomputing partial products of the static extents,
>> >>>>>>>>>         - special cases for low-rank extents,
>> >>>>>>>>>         - rewriting the condition E[i] == dynamic_extent in a
>> more
>> >>>>>>>>>           optimizer friendly manner.
>> >>>>>>>>>         - effectively loop-unrolling extents::operator==.
>> >>>>>>>>>
>> >>>>>>>>> While simplistic micro-benchmarking shows the effectiveness of
>> >> these
>> >>>>>>>>> changes, likely the stronger argument is presented in each
>> commit:
>> >>>>>>>>>       a) each change removes needless complexity,
>> >>>>>>>>>       b) before/after examples of generated code show the
>> >>>> effectiveness.
>> >>>>>>>>>
>> >>>>>>>>> Luc Grosheintz (8):
>> >>>>>>>>>       libstdc++: Reduce template instantiations in <mdspan>.
>> >>>>>>>>>       libstdc++: Precompute products of static extents.
>> >>>>>>>>>       libstdc++: Improve low-rank layout_{left,right}::stride.
>> >>>>>>>>>       libstdc++: Improve fully dynamic extents in mdspan.
>> >>>>>>>>>       libstdc++: Improve nearly fully dynamic extents in mdspan.
>> >>>>>>>>>       libstdc++: Reduce indirection in extents::extent.
>> >>>>>>>>>       libstdc++: Improve extents::operator==.
>> >>>>>>>>>       libstdc++: Replace numeric_limit with __int_traits in
>> mdspan.
>> >>>>>>>>>
>> >>>>>>>>>      libstdc++-v3/include/std/mdspan               | 282
>> >>>>>>> +++++++++++++-----
>> >>>>>>>>>      .../mdspan/extents/class_mandates_neg.cc      |   3 +
>> >>>>>>>>>      2 files changed, 208 insertions(+), 77 deletions(-)
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> 2.50.0
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >
>>
>>

Re: [PATCH v2 0/8] libstdc++: Sequence of improvements to .

Reply via email to