On Mon, Jul 14, 2014 at 12:19 PM, Andres Freund <and...@2ndquadrant.com> wrote: > On 2014-07-14 11:24:26 -0400, Robert Haas wrote: >> On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund <and...@2ndquadrant.com> >> wrote: >> > The actual if (lock != NULL) bit costs significant amounts of cycles? >> > I'd have assumed that branch prediction takes care of that. Or is it >> > actually the icache not keeping up? Did you measure icache vs. dcache >> > misses? >> > Have you played with unlikely()/likely() type of macros? >> >> I have not. I think it's really got more to do with how much stuff >> needs to be saved in the stack frame, but I might be wrong about that. > > I don't really see why that'd play such a big role. The register > pressure on ppc/amd64 shouldn't be high enough that the (lock != NULL) > alone requires to push anything on the stack. Sure, the call to > LWLockAcquire() will, but if that's done in the separate branch...
I can't tell you for sure what the cause is; I just saw that the difference was very significant. >> I found that it was possible to buy back most of the cost by >> replacing (*context->methods->free_p) (context, pointer); with a >> hard-coded AllocSetFree(context, pointer), so that gives you some idea >> what order of magnitude we're talking about here. > > That's measured with a microbenchmark or actual postgres workloads? > Because in my measurements there wasn't consistent benefit in doing so > even when benchmarking workloads where allocation is a bottleneck. Microbenchmark. >> >> 6. In general, I'm worried that it's going to be hard to keep the >> >> overhead of parallel sort from leaking into the non-parallel case. >> >> With the no-allocator approach, every place that uses >> >> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the >> >> DSM and non-DSM cases differently, which isn't great for either >> >> performance or maintainability. And even with an allocator, the >> >> SortTuple array will need to use relative pointers in a DSM; that >> >> might burden the non-DSM case. >> > >> > Yes, I share that concern. >> > >> > I somewhat doubt that tuplesort.c really is the right layer to do the >> > parallel part of parallel sort including the chunked storage. Why not >> > pack the tuples outside it and teach tuplesort.c to not copy tuples for >> > some _put* routine? Then tuplesort can entirely happen inside a single >> > process and doesn't have to worry about indirect pointers and anything? >> > Yes, it'll need slightly more memory, but that doesn't sound too bad. >> >> I'm not sure I understand what you're describing here: >> >> - the _put* routine has to do with how tuples get into the tuplesort; >> but parallel sort has to do with how we get them in order once they're >> in there >> - having tuplesort happen inside a single process seems like the exact >> opposite of parallel sort >> - why would we need more memory? >> >> But having expressed my confusion, I'd certainly welcome further >> comment on this point, because I think figuring out how to set up the >> abstraction is in some ways the hardest part of this problem - and it >> is certainly made harder by the complexity of the existing >> abstraction. > > I think we're imagining quite different approaches to the problem. In my > mind the parallelism part is essentially a layer *above* tuplesort.c, > but I think you're thinking about a much more integrated approach. So, I don't see how that can work. tuplesort.c is *the* entrypoint for sorting throughout the backend. If you didn't use those same entrypoints, you'd have to update every caller, which seems tedious and rather pointless. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers