On Fri, 25 Oct 2024 16:47:42 GMT, Maurizio Cimadamore <mcimadam...@openjdk.org> wrote:
>> This PR fixes an issue where passing many by-reference parameters to >> downcall results in escape analysis failures. >> The problem is that, as the parameters grow, the generated code in the >> trampoline stub we generate also grows. >> When it reaches a certain threshold, it becomes too big, and it is no longer >> inlined in the caller. >> When this happens, allocations for by-reference parameters (e.g. a segment >> constructed from `MemorySegment::ofAddress`) can no longer be eliminated. >> >> The solution is two-fold. First, we annotate the generated trampoline with >> `@ForceInline`. After all, it is rather critical, to guarantee optimal >> performance, that this code can be always inlined. >> Second, to keep the size of the generated code under control, we also put a >> limit on the max number of comparisons that are generated in order to >> "deduplicate" scope acquire/release calls. >> The deduplication logic is a bit finicky -- it was put in place because, >> when confined and shared are passed by-ref, we need to prevent them from >> being closed in the middle of a native call. >> So, we save all the seen scopes in a bunch of locals, and then we compare >> each new scope with _all_ the previous cached locals, and skip acquire if we >> can. >> >> While this strategy work it does not scale when there are many by-ref >> parameters - as a by-ref parameter N will need N - 1 comparisons - which >> means a quadratic number of comparisons is generated. >> This is fixed in this PR by putting a lid on the maximum number of >> comparisons that are generated. We also make the comparisons a bit smarter, >> by always skipping the very first by-ref argument -- the downcall address. >> It is in fact very common for the downcall address to be in a different >> scope than that of the other by-ref arguments anyway. >> >> A nice property of the new logic is that by configuring the max number of >> comparisons we can effectively select between different strategies: >> * max = 0, means no dedup >> * max = 1, means one-element cache >> * max = N, means full dedup (like before) >> >> Thanks to Ioannis (@spasi) for the report and the benchmark. I've beefed the >> benchmark up by adding a case for 10 arguments, and also adding support for >> critical downcalls, so we could also test passing by-ref heap segments. >> Benchmark result will be provided in a separate comment. > > Maurizio Cimadamore has updated the pull request incrementally with one > additional commit since the last revision: > > Fix issue with native benchmark lib Marked as reviewed by jvernee (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/21706#pullrequestreview-2396072745