Re: Use generation context to speed up tuplesorts

Tomas Vondra Wed, 08 Dec 2021 13:07:36 -0800

On 12/8/21 16:51, Ronan Dunklau wrote:
> Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
>> And now comes the funny part - if I run it in the same backend as the
>> "full" benchmark, I get roughly the same results:
>>
>>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>>      ------------+------------+---------------+----------+---------
>>            32768 |        512 |     806256640 |    37159 |   76669
>>
>> but if I reconnect and run it in the new backend, I get this:
>>
>>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>>      ------------+------------+---------------+----------+---------
>>            32768 |        512 |     806158336 |   233909 |  100785
>>      (1 row)
>>
>> It does not matter if I wait a bit before running the query, if I run it
>> repeatedly, etc. The machine is not doing anything else, the CPU is set
>> to use "performance" governor, etc.
> 
> I've reproduced the behaviour you mention.
> I also noticed asm_exc_page_fault showing up in the perf report in that case.
> 
> Running an strace on it shows that in one case, we have a lot of brk calls, 
> while when we run in the same process as the previous tests, we don't.
> 
> My suspicion is that the previous workload makes glibc malloc change it's 
> trim_threshold and possibly other dynamic options, which leads to constantly 
> moving the brk pointer in one case and not the other.
> 
> Running your fifo test with absurd malloc options shows that indeed that 
> might 
> be the case (I needed to change several, because changing one disable the 
> dynamic adjustment for every single one of them, and malloc would fall back 
> to 
> using mmap and freeing it on each iteration):
> 
> mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
> mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
> mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));
> 
> I get the following results for your self contained test. I ran the query 
> twice, in each case, seeing the importance of the first allocation and the 
> subsequent ones:
> 
> With default malloc options:
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ------------+------------+---------------+----------+---------
>       32768 |        512 |     795836416 |   300156 |  207557
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ------------+------------+---------------+----------+---------
>       32768 |        512 |     795836416 |   211942 |   77207
> 
> 
> With the oversized values above:
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ------------+------------+---------------+----------+---------
>       32768 |        512 |     795836416 |   219000 |   36223
> 
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ------------+------------+---------------+----------+---------
>       32768 |        512 |     795836416 |    75761 |   78082
> (1 row)
> 
> I can't tell how representative your benchmark extension would be of real 
> life 
> allocation / free patterns, but there is probably something we can improve 
> here.
>


Thanks for looking at this. I think those allocation / free patterns are
fairly extreme, and there probably are no workloads doing exactly this.
The idea is the actual workloads are likely some combination of these
extreme cases.

> I'll try to see if I can understand more precisely what is happening.
> 

Thanks, that'd be helpful. Maybe we can learn something about tuning
malloc parameters to get significantly better performance.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

Reply via email to