Re: [HACKERS] things I learned from working on memory allocation

Robert Haas Tue, 15 Jul 2014 19:49:22 -0700

On Mon, Jul 14, 2014 at 12:19 PM, Andres Freund <[email protected]> wrote:
> On 2014-07-14 11:24:26 -0400, Robert Haas wrote:
>> On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund <[email protected]> 
>> wrote:
>> > The actual if (lock != NULL) bit costs significant amounts of cycles?
>> > I'd have assumed that branch prediction takes care of that. Or is it
>> > actually the icache not keeping up? Did you measure icache vs. dcache
>> > misses?
>> > Have you played with unlikely()/likely() type of macros?
>>
>> I have not.  I think it's really got more to do with how much stuff
>> needs to be saved in the stack frame, but I might be wrong about that.
>
> I don't really see why that'd play such a big role. The register
> pressure on ppc/amd64 shouldn't be high enough that the (lock != NULL)
> alone requires to push anything on the stack. Sure, the call to
> LWLockAcquire() will, but if that's done in the separate branch...


I can't tell you for sure what the cause is; I just saw that the
difference was very significant.

>> I found that it was possible to buy back most of the cost by
>> replacing (*context->methods->free_p) (context, pointer); with a
>> hard-coded AllocSetFree(context, pointer), so that gives you some idea
>> what order of magnitude we're talking about here.
>
> That's measured with a microbenchmark or actual postgres workloads?
> Because in my measurements there wasn't consistent benefit in doing so
> even when benchmarking workloads where allocation is a bottleneck.

Microbenchmark.

>> >> 6. In general, I'm worried that it's going to be hard to keep the
>> >> overhead of parallel sort from leaking into the non-parallel case.
>> >> With the no-allocator approach, every place that uses
>> >> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
>> >> DSM and non-DSM cases differently, which isn't great for either
>> >> performance or maintainability.  And even with an allocator, the
>> >> SortTuple array will need to use relative pointers in a DSM; that
>> >> might burden the non-DSM case.
>> >
>> > Yes, I share that concern.
>> >
>> > I somewhat doubt that tuplesort.c really is the right layer to do the
>> > parallel part of parallel sort including the chunked storage. Why not
>> > pack the tuples outside it and teach tuplesort.c to not copy tuples for
>> > some _put* routine? Then tuplesort can entirely happen inside a single
>> > process and doesn't have to worry about indirect pointers and anything?
>> > Yes, it'll need slightly more memory, but that doesn't sound too bad.
>>
>> I'm not sure I understand what you're describing here:
>>
>> - the _put* routine has to do with how tuples get into the tuplesort;
>> but parallel sort has to do with how we get them in order once they're
>> in there
>> - having tuplesort happen inside a single process seems like the exact
>> opposite of parallel sort
>> - why would we need more memory?
>>
>> But having expressed my confusion, I'd certainly welcome further
>> comment on this point, because I think figuring out how to set up the
>> abstraction is in some ways the hardest part of this problem - and it
>> is certainly made harder by the complexity of the existing
>> abstraction.
>
> I think we're imagining quite different approaches to the problem. In my
> mind the parallelism part is essentially a layer *above* tuplesort.c,
> but I think you're thinking about a much more integrated approach.

So, I don't see how that can work.  tuplesort.c is *the* entrypoint
for sorting throughout the backend.  If you didn't use those same
entrypoints, you'd have to update every caller, which seems tedious
and rather pointless.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] things I learned from working on memory allocation

Reply via email to