On Tue, Aug 5, 2025 at 2:49 PM Tomasz Kaminski <tkami...@redhat.com> wrote:
> > > On Tue, Aug 5, 2025 at 2:14 PM Luc Grosheintz <luc.groshei...@gmail.com> > wrote: > >> >> >> On 8/5/25 13:25, Tomasz Kaminski wrote: >> > On Tue, Aug 5, 2025 at 1:14 PM Luc Grosheintz <luc.groshei...@gmail.com >> > >> > wrote: >> > >> >> >> >> >> >> On 8/5/25 10:16, Tomasz Kaminski wrote: >> >>> Hi, >> >>> >> >>> I have posted v3 patches with changes I have made locally for first 6 >> >>> patches, and I think this series >> >>> is ready to land, in addition to >> >>> https://gcc.gnu.org/pipermail/libstdc++/2025-July/062727.html, that I >> >>> already reviewed. >> >>> I will keep aligned accessor on top of it. >> >>> >> >>> As the separate commit, we should update the __fwd_partial_prod and >> >>> __rev_partial_prod computation >> >>> to use fact that this is partial prod, and we have previous partial >> >>> product computed: >> >>> template<array _Extents> >> >>> constexpr auto __fwd_partial_prods = [] consteval >> >>> { >> >>> constexpr size_t __rank = _Extents.size(); >> >>> std::array<size_t, __rank + 1> __ret; >> >>> __ret[0] = 1; >> >>> for(size_t __r = 1; __r < __rank + 1; ++__r) >> >>> if (_Extents[__r] != dynamic_extents) >> >>> __ret[__r] = __ret[0] * _Extents[__r]; >> >>> return __ret; >> >>> }(); >> >>> We are doing this at compile time, but this should help composition >> >> speed. >> >>> I would then inline __static_prod loop into the __mdspan::__size >> function >> >>> and remove that function >> >>> entirely. >> >> >> >> I checked both approaches, and it didn't affect the compile time in >> >> the slightest, so I decided it wasn't worth it; and went with the >> >> less error prone solution. OTOH, you're right, knowingly leaving in >> >> something that's accidentally quadratic is begging for some obscure >> >> problem to arise. Moreover, measuring these can be deceptive, because >> >> they have a tendency to be fast enough, right up to the point where it >> >> becomes prohibitively expensive. >> >> >> > Note that we are talking about the compile-time computation, that is >> done >> > by an interpreter. It will never be costfull at runtime. And remove the >> > unnecessary >> > element from __fwd_partial_prod that should be the size of __rank >> instead >> > of __rank + 1. >> >> Yes, but leaving in code that could bring the compiler / compilation >> to a grinding halt doesn't seems smart either (in hindsight) :) >> >> > >> >> >> >> If you want I can fix this up in a separate patch. >> >> >> >>> >> >>> >> >>> I would be interested if we could reduce the __fwd/rev_partial_prod >> >> sizes, >> >>> but not keep the outermost dimensions there, >> >>> i.e. adding a check for __r == 0 early in the __fwd_prod funciton. >> >>> constexpr size_t __rank = _Extents::rank(); >> >>> constexpr auto& __sta_exts = __static_extents<_Extents>(); >> >>> if constexpr (__rank == 1) >> >>> return 1; >> >>> // new if here, that is extracted from __rank == 2 >> >>> else if (__r == 0) >> >>> return 1; >> >>> else if constexpr (__rank == 2) >> >>> return _exts.extent(0); >> >>> else if constexpr >> >>> (__all_dynamic(std::span(__sta_exts).first(__rank-1))) >> >>> return __extents_prod(__exts, 1, 0, __r); >> >>> else >> >>> { >> >>> size_t __sta_prod = >> __fwd_partial_prods<__sta_exts>[__r-1]; >> >> // >> >>> size reduce by one here >> >>> return __extents_prod(__exts, __sta_prod, 0, __r); >> >>> } >> >> >> >> To me it's not obvious how this one will go. Intuitively, this adds a >> >> comparison and branching to avoid loading 8 bytes. Compared to the >> other >> >> size reductions, this has a lower upper limit as to how effective it's >> >> going to be. >> >> >> > My idea here is that we will avoid storing 8 bytes per possible >> combination >> > of extents values, to store a constant that always has a value of 1. >> > This seems wasteful to me. So, I would go for it if: >> > * produces the same code for rank == 0, as before >> > * does not prevent using vectorization for all dynamic extents with, >> just >> > with >> > addition of single if. >> >> Since you really don't like wasting space: in many cases the 7 high >> bytes of size_t will be zero (or 0xff...ff). What's stopping us from >> storing [3, 5, 7] in an array<uint8_t>? We need to be very careful >> handling dynamic_extent; but I think it's only used in checks, e.g. >> >> _Extent[i] == dynamic_extent >> >> and that can be translated to >> >> _Extent[i] == decltype(_Extent[i])(-1) >> >> and we've saved 8x .rodata space =) >> > Actually, using the product of all static extents to determine the element > type > __fwd_partial_prod/__rev_partial_prod sounds like something we could > explore > if someone complains about the sizes. Same for the _S_dynamic_index_data. > As I was unclear, I agree that changing the code to remove 1 extra element of the array is premature, and we may have better avenues to reduce that. So let's just clear up the __fwd_prod computations and remove unused __rank+1 element. > > >> > >> >> I don't think I'll work on this, since I don't have the required test >> >> cases to properly measure this to justify the change; and I don't trust >> >> myself to make the right call without very exhaustive measurements. I >> >> also don't have access to a suitably large variety of hardware. >> >> >> >>> >> >>> Similarly for __rev_prod: >> >>> constexpr size_t __rank = _Extents::rank(); >> >>> constexpr auto& __sta_exts = __static_extents<_Extents>(); >> >>> if constexpr (__rank == 1) >> >>> return 1; >> >>> else if (__r == __rank - 1) >> >>> return 1; >> >>> else if constexpr (__rank == 2) >> >>> return __exts.extent(1); >> >>> else if constexpr >> >>> (__all_dynamic(std::span(__sta_exts).last(__rank-1))) >> >>> return __extents_prod(__exts, 1, __r + 1, __rank); >> >>> else >> >>> { >> >>> size_t __sta_prod = >> __rev_partial_prods<__sta_exts>[__r-1]; >> >> // >> >>> size reduced by one here. >> >>> return __extents_prod(__exts, __sta_prod, __r + 1, >> __rank); >> >>> } >> >>> >> >>> Regards, >> >>> Tomasz >> >>> >> >>> >> >>> On Mon, Aug 4, 2025 at 7:51 PM Luc Grosheintz < >> luc.groshei...@gmail.com> >> >>> wrote: >> >>> >> >>>> >> >>>> >> >>>> On 8/4/25 17:42, Tomasz Kaminski wrote: >> >>>>> On Mon, Aug 4, 2025 at 1:14 PM Tomasz Kaminski <tkami...@redhat.com >> > >> >>>> wrote: >> >>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Mon, Aug 4, 2025 at 1:08 PM Luc Grosheintz < >> >> luc.groshei...@gmail.com >> >>>>> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> Hi Tomasz, >> >>>>>>> >> >>>>>>> Thank you for the review! Sorry about the missing parens, even >> after >> >>>>>>> "Ctrl+F"ing each of the emails I can't find the requires clause >> (the >> >>>>>>> two I found both need the parens). >> >>>>>>> >> >>>>>> I was mentioning this one, but it is possible that I am wrong on >> this >> >>>> one, >> >>>>>> I haven't yet checked it, as I was planning to do it during full >> >> review. >> >>>>>> template<array _Extents> >> >>>>>> requires (__all_dynamic<_Extents>()) >> >>>>>> class _StaticExtents<_Extents>; >> >>>>>> But regardless, I will make that change locally if it does work, so >> >> you >> >>>> do >> >>>>>> not >> >>>>>> need to update patch series for it. >> >>>>>> >> >>>>> Parenthesis seem to be indeed required here, so my comment was a red >> >>>>> herring. >> >>>> >> >>>> I'm pretty sure that this is the reason why I wrongly concluded >> >>>> that the pattern is `requires(cond)`, i.e. the parens are always >> >>>> needed. >> >>>> >> >>>>> >> >>>>>> >> >>>>>>> Do you think it's possible to merge this series before the other >> two >> >>>>>>> outstanding patch series? There's some risk of collision; making >> >>>> changes >> >>>>>>> to this patch series is much more time consuming than rebasing >> (and >> >>>>>>> retesting the other two patch series). >> >>>>>>> >> >>>>>> This sounds very reasonable. I will prepare a stack of patches in >> your >> >>>>>> suggested >> >>>>>> order locally, and push them once they are approved. But it still >> make >> >>>>>> >> >>>>>>> >> >>>>>>> Thank you, >> >>>>>>> >> >>>>>> Thank you for your continued contributions. >> >>>>>> >> >>>>>>> Luc >> >>>>>>> >> >>>>>>> On 8/4/25 12:43, Tomasz Kaminski wrote: >> >>>>>>>> Hi, >> >>>>>>>> >> >>>>>>>> I this time I made a quick pass through all changes, before >> >> commenting >> >>>>>>> on >> >>>>>>>> first commits, >> >>>>>>>> and they look solid to me, and I haven't noticed anything I would >> >> like >> >>>>>>> to >> >>>>>>>> change (except parentheses >> >>>>>>>> around requires, but I will handle that locally). I will try to >> do a >> >>>>>>> full >> >>>>>>>> review during this week. >> >>>>>>>> >> >>>>>>>> Regards, >> >>>>>>>> Tomasz >> >>>>>>>> >> >>>>>>>> On Sun, Aug 3, 2025 at 10:59 PM Luc Grosheintz < >> >>>>>>> luc.groshei...@gmail.com> >> >>>>>>>> wrote: >> >>>>>>>> >> >>>>>>>>> The combined effect of this sequence of change is: >> >>>>>>>>> >> >>>>>>>>> * a reduction in the number of template instantiations, by >> >>>>>>>>> - avoiding needless dependency of IndexType, >> >>>>>>>>> - special formulas for low-rank extents, >> >>>>>>>>> - special formulas for (nearly) fully dynamic extents. >> >>>>>>>>> >> >>>>>>>>> * improved code quality, by >> >>>>>>>>> - precomputing partial products of the static extents, >> >>>>>>>>> - special cases for low-rank extents, >> >>>>>>>>> - rewriting the condition E[i] == dynamic_extent in a >> more >> >>>>>>>>> optimizer friendly manner. >> >>>>>>>>> - effectively loop-unrolling extents::operator==. >> >>>>>>>>> >> >>>>>>>>> While simplistic micro-benchmarking shows the effectiveness of >> >> these >> >>>>>>>>> changes, likely the stronger argument is presented in each >> commit: >> >>>>>>>>> a) each change removes needless complexity, >> >>>>>>>>> b) before/after examples of generated code show the >> >>>> effectiveness. >> >>>>>>>>> >> >>>>>>>>> Luc Grosheintz (8): >> >>>>>>>>> libstdc++: Reduce template instantiations in <mdspan>. >> >>>>>>>>> libstdc++: Precompute products of static extents. >> >>>>>>>>> libstdc++: Improve low-rank layout_{left,right}::stride. >> >>>>>>>>> libstdc++: Improve fully dynamic extents in mdspan. >> >>>>>>>>> libstdc++: Improve nearly fully dynamic extents in mdspan. >> >>>>>>>>> libstdc++: Reduce indirection in extents::extent. >> >>>>>>>>> libstdc++: Improve extents::operator==. >> >>>>>>>>> libstdc++: Replace numeric_limit with __int_traits in >> mdspan. >> >>>>>>>>> >> >>>>>>>>> libstdc++-v3/include/std/mdspan | 282 >> >>>>>>> +++++++++++++----- >> >>>>>>>>> .../mdspan/extents/class_mandates_neg.cc | 3 + >> >>>>>>>>> 2 files changed, 208 insertions(+), 77 deletions(-) >> >>>>>>>>> >> >>>>>>>>> -- >> >>>>>>>>> 2.50.0 >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>> >> >>>> >> >>>> >> >>> >> >> >> >> >> > >> >>