On Wed, Aug 4, 2021 at 1:19 PM Maged Michael <maged.mich...@gmail.com> wrote:
> Sorry. I totally missed the rest of your message and the patch. My fuzzy > eyesight, which usually guesses correctly 90% of the time, mistook > "Secondly" on a line by itself for "Sincerely" :-) > > On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely <jwakely....@gmail.com> > wrote: > >> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote: >> > >> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote: >> > > >> > > This is the right patch. The previous one is missing noexcept. Sorry. >> > > >> > > >> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.mich...@gmail.com> >> wrote: >> > >> >> > >> Please find attached an updated patch after incorporating Jonathan's >> suggestions. >> > >> >> > >> Changes from the last patch include: >> > >> - Add a TSAN macro to bits/c++config. >> > >> - Use separate constexpr bool-s for the conditions for lock-freedom, >> double-width and alignment. >> > >> - Move the code in the optimized path to a separate function >> _M_release_double_width_cas. >> > >> > Thanks for the updated patch. At a quick glance it looks great. I'll >> > apply it locally and test it tomorrow. >> >> >> It occurs to me now that _M_release_double_width_cas is the wrong >> name, because we're not doing a DWCAS, just a double-width load. What >> do you think about renaming it to _M_release_combined instead? Since >> it does a combined load of the two counts. >> >> I'm no longer confident in the alignof suggestion I gave you. >> >> + constexpr bool __double_word >> + = sizeof(long long) == 2 * sizeof(_Atomic_word); >> + // The ref-count members follow the vptr, so are aligned to >> + // alignof(void*). >> + constexpr bool __aligned = alignof(long long) <= alignof(void*); >> >> For IA32 (i.e. 32-bit x86) this constant will be true, because >> alignof(long long) and alignof(void*) are both 4, even though >> sizeof(long long) is 8. So in theory the _M_use_count and >> _M_weak_count members could be in different cache lines, and the >> atomic load will not be atomic. I think we want to use __alignof(long >> long) here to get the "natural" alignment, not the smaller 4B >> alignment allowed by SysV ABI. That means that we won't do this >> optimization on IA32, but I think that's acceptable. >> >> Alternatively, we could keep using alignof, but with an additional >> run-time check something like (uintptr_t)&_M_use_count / 64 == >> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache >> line. I think for now we should just use __alignof and live without >> the optimization on IA32. >> >> Secondly: >> >> + void >> + __attribute__ ((noinline)) >> >> This needs to be __noinline__ because noinline is not a reserved word, >> so users can do: >> #define noinline 1 >> #include <memory> >> >> Was it performance profiling, or code-size measurements (or something >> else?) that showed this should be non-inline? >> I'd like to add a comment explaining why we want it to be noinline. >> >> The noinlining was based on looking at generated code. That was for > clang. It was inlining the _M_last_use function for every instance of > _M_release (e.g., ~shared_ptr). This optimization with the noinline for > _M_release_last_use ended up reducing massive binary text sizes by 1.5% > (several megabytes)., which was an extra benefit. Without the noinline we > saw code size increase. > > IIUC, we van use the following. Right? > > __attribute__((__noinline__)) > > > I didn't understand the part about programmers doing #define noinline 1. > I don't see code in the patch that uses noinline. > > How about something like this comment? > > // Noinline to avoid code size increase. > > > > In that function ... >> >> + _M_release_last_use() noexcept >> + { >> + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); >> + _M_dispose(); >> + if (_Mutex_base<_Lp>::_S_need_barriers) >> + { >> + __atomic_thread_fence (__ATOMIC_ACQ_REL); >> + } >> >> I think we can remove this fence. The _S_need_barriers constant is >> only true for the _S_mutex policy, and that just calls >> _M_release_orig(), so never uses this function. I'll remove it and add >> a comment noting that the lack of barrier is intentional. >> >> + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); >> + if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, >> + -1) == 1) >> + { >> + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); >> + _M_destroy(); >> + } >> + } >> >> Alternatively, we could keep the fence in _M_release_last_use() and >> refactor the other _M_release functions, so that we have explicit >> specializations for each of: >> _Sp_counted_base<_S_single>::_M_release() (already present) >> _Sp_counted_base<_S_mutex>::_M_release() >> _Sp_counted_base<_S_atomic>::_M_release() >> >> The second and third would be new, as currently they both use the >> definition in the primary template. The _S_mutex one would just >> decrement _M_use_count then call _M_release_last_use() (which still >> has the barrier needed for the _S_mutex policy). The _S_atomic one >> would have your new optimization. See the attached patch showing what >> I mean. I find this version a bit simpler to understand, as we just >> have _M_release and _M_release_last_use, without >> _M_release_double_width_cas and _M_release_orig. What do you think of >> this version? Does it lose any important properties of your version >> which I've failed to notice? >> > > Unless I missed something I don't think the patch is correct. IIUC it uses > _M_release_last_use as if it is equivalent to what was _M_release_orig, > which is not true. The logic in _M_release_last_use assumes that the use > count has already been decremented and the return value was 1. So I don't > think the refactor works. Please correct me if I missed something.. > > Actually I take back what I said. Sorry. I think the logic in your patch is correct. I missed some of the atomic decrements. But I'd be concerned about performance. If we make _M_release_last_use noinline then we are adding overhead to the fast path of the original logic (where both counts are 1). What do you think? > If I am not mistaken, how about I prepare a new patch that incorporates > the following? > - __noinline__ > - The change to __align in my last message. > - Remove the unnecessary checks based on _S_need_barriers in the combined > / last_use path. > - Going back to having _M_release_orig and _M_release_combined > - Include all of the comments that you added in your patch. > > Would that be OK with you? > > Thanks, > Maged > >