Hi Iain, I do not currently have metrics for this, but I can look into generating them, however I currently do not know of any large open-source projects using coroutines that I could use for this; I was thinking about using cppcoro unit tests, but they mostly contain very simple coroutines. I have source for ~20k LoC proprietary project relying heavily on coroutines (using boost::asio::awaitable), but here I cannot show the source along with the numbers - would this be enough or should I look for more open source projects?
thanks, Michal On May 14 2023, at 6:07 pm, Iain Sandoe <i...@sandoe.co.uk> wrote: > Hi Michal, > >> On 14 May 2023, at 16:36, Michal Jankovič >> <michal.jankovi...@gmail.com> wrote: >> >> Rebased the patch to GCC 14 trunk. Bootstrapped and regression tested >> again on x86_64-pc-linux-gnu, only difference is the new test failing >> without the patch. > > (as previously noted, I am much in favour of this optimisation) > > Do you have any metrics on the reductions in frame size for realistic > coroutines? > > thanks > Iain > >> >> On Jul 13 2022, at 2:54 pm, Michal Jankovic >> <michal.jankovi...@gmail.com> wrote: >> >>> Hi Iain, >>> >>> thanks for the info. I have some follow-up questions. >>> >>> On Jul 12 2022, at 7:11 pm, Iain Sandoe <i...@sandoe.co.uk> wrote: >>> >>>> Hi Michal, >>>> >>>>> On 12 Jul 2022, at 16:14, Michal Jankovič >>>>> <michal.jankovi...@gmail.com> wrote: >>>> >>>>> One other related thing I would like to investigate is reducing the >>>>> number of compiler generated variables in the frame, particularly >>>>> _Coro_destroy_fn and _Coro_self_handle. >>>>> >>>>> As I understand it, _Coro_destroy_fn just sets a flag in >>>>> _Coro_resume_index and calls _Coro_resume_fn; it should be >>>>> possible to >>>>> move this logic to __builtin_coro_destroy, so that only _Coro_resume_fn >>>>> is stored in the frame; >>>> >>>> That is a particular point about GCC’s implementation … (it is not >>>> neccesarily, or even >>>> likely to be the same for other implementations) - see below. >>>> >>>> I was intending to do experiment with making the ramp/resume/destroy >>>> value a parameter >>>> to the actor function so that we would have something like - >>>> >>>> ramp calls actor(frame, 0) >>>> resume calls actor(frame, 1) >>>> destroy calls actor(frame, 2) >>>> - the token values are illustrative, not intended to be a final version. >>>> >>>> I think that should allow for more inlining opportunites and possibly >>>> a way forward to >>>> frame elision (a.k.a halo). >>>> >>>>> this would however change the coroutine ABI - I don't know if that's >>>>> a problem. >>>> >>>> The external ABI for the coroutine is the >>>> resume, >>>> destroy pointers >>>> and the promise >>>> and that one can find each of these from the frame pointer. >>>> >>>> This was agreed between the interested “vendors” so that one compiler >>>> could invoke >>>> coroutines built by another. So I do not think this is so much a >>>> useful area to explore. >>>> >>> >>> I understand. I still want to try to implement a more light-weight frame >>> layout with just one function pointer; would it be possible to merge >>> such a change if it was made opt-in via a compiler flag, eg >>> `-fsmall-coroutine-frame`? My use-case for this is embedded environments >>> with very limited memory, and I do not care about interoperability with >>> other compilers there. >>> >>>> Also the intent is that an indirect call through the frame pointer is >>>> the most frequent >>>> operation so should be the most efficient. >>>> resume() might be called many times, >>>> destroy() just once thus it is a cold code path >>>> - space can be important too - but interoperability was the goal here. >>>> >>>>> The _Coro_self_handle should be constructible on-demand from the >>>>> frame address. >>>> >>>> Yes, and in the header the relevant items are all constexpr - so that >>>> should happen in the >>>> user’s code. I elected to have that value in the frame to avoid >>>> recreating it each time - I >>>> suppose that is a trade-off of one oiptimisation c.f. another … >>> >>> If the handle construction cannot be optimized out, and its thus >>> a tradeoff between frame size and number of instructions, then this >>> could also be enabled by a hypothetical `-fsmall-coroutine-frame`. >>> >>> Coming back to this: >>> >>>>>> (the other related optimisation is to eliminate frame entries for >>>>>> scopes without any suspend >>>>>> points - which has the potential to save even more space for code with >>>>>> sparse use of co_xxxx) >>> >>> This would be nice; although it could encompassed by a more general >>> >>> optimization - eliminate frame entries for all variables which are not >>> >>> accessed (directly or via pointer / reference) beyond a suspend point. >>> To be fair, I do not know how to get started on such an optimization, >>> or if it is even possible to do on the frontend. This would however be >>> immensely useful for reducing the frame size taken-up by complicated >>> co_await expressions (among other things), for example, if I have a >>> composed operation: >>> >>> co_await when_either(get_leaf_awaitable_1(), get_leaf_awaitable_2()); >>> >>> Right now, this creates space in the frame for the temporary 'leaf' >>> >>> awaitables, which were already moved into the composed awaitable. >>> If the awaitable has an operator co_await that returns the real awaiter, >>> the original awaitable is also stored in the frame, even if it >>> is not referenced by the awaiter; another unused object gets stored if >>> >>> the .await_transform() customization point was used. >>> >>> What are your thoughts on the feasibility / difficulty of implementing >>> such an optimization? >>> >>> Michal >>> >>>>> >>>>> Do you have any advice / opinions on this before I try to >>>>> implement it? >>>> >>>> Hopefully, the notes above help. >>>> >>>> I will rebase my latest code changes as soon as I have a chance and >>>> put them somewhere >>>> for you to look at - basically, these are to try and address the >>>> correctness issues we face, >>>> >>>> Iain >>>> >>>> >>>>> >>>>> Michal >>>>> >>>>> On Jul 12 2022, at 4:08 pm, Iain Sandoe <i...@sandoe.co.uk> wrote: >>>>> >>>>>> Hi Michal, >>>>>> >>>>>>> On 12 Jul 2022, at 14:35, Michal Jankovič via Gcc-patches >>>>>>> <gcc-patches@gcc.gnu.org> wrote: >>>>>>> >>>>>>> Currently, coroutine frames store all variables of a coroutine >>>>>>> separately, >>>>>>> even if their lifetime does not overlap (they are in distinct >>>>>>> scopes). This >>>>>>> patch implements overlapping distinct variable scopes in the >>>>>>> coroutine frame, >>>>>>> by storing the frame fields in nested unions of structs. This lowers >>>>>>> the size >>>>>>> of the frame for larger coroutines significantly, and makes them >>>>>>> more usable >>>>>>> on systems with limited memory. >>>>>> >>>>>> not a review (I will try to take a look at the weekend). >>>>>> >>>>>> but … this is one of the two main optimisations on my TODO - so cool >>>>>> for doing it. >>>>>> >>>>>> (the other related optimisation is to eliminate frame entries for >>>>>> scopes without any suspend >>>>>> points - which has the potential to save even more space for code with >>>>>> sparse use of co_xxxx) >>>>>> >>>>>> Iain >>>>>> >>>>>>> Bootstrapped and regression tested on x86_64-pc-linux-gnu; new >>>>>>> test fails >>>>>>> before the patch and succeeds after with no regressions. >>>>>>> >>>>>>> PR c++/105989 >>>>>>> >>>>>>> gcc/cp/ChangeLog: >>>>>>> >>>>>>> * coroutines.cc (struct local_var_info): Add field_access_path. >>>>>>> (build_local_var_frame_access_expr): New. >>>>>>> (transform_local_var_uses): Use >>>>>>> build_local_var_frame_access_expr. >>>>>>> (coro_make_frame_entry_id): New. >>>>>>> (coro_make_frame_entry): Delegate to coro_make_frame_entry_id. >>>>>>> (struct local_vars_frame_data): Add orig, field_access_path. >>>>>>> (register_local_var_uses): Generate new frame layout. Create >>>>>>> access >>>>>>> paths to vars. >>>>>>> (morph_fn_to_coro): Set new fields in local_vars_frame_data. >>>>>>> >>>>>>> >>>>>>> gcc/testsuite/ChangeLog: >>>>>>> >>>>>>> * g++.dg/coroutines/pr105989.C: New test. >>>>>>> >>>>>>> <pr105989.patch> >>>>>> >>>>>> >>>> >>>> >> <pr105989.patch> > >