On Thu, Feb 27, 2020 at 01:01:08PM -0800, Jeff Davis wrote: > Attached is a patch that exports a new function from logtape.c: > > extern LogicalTapeSet *LogicalTapeSetExtend( > LogicalTapeSet *lts, int nAdditional); > > The purpose is to allow adding new tapes dynamically for the upcoming > Hash Aggregation work[0]. HashAgg doesn't know in advance how many > tapes it will need, though only a limited number are actually active at > a time. > > This was proposed and originally written by Adam Lee[1] (extract only > the changes to logtape.c/h from his posted patch). Strangely, I'm > seeing ~5% regression with his version when doing: > > -- t20m_1_int4 has 20 million random integers > select * from t20m_1_int4 order by i offset 100000000; > > Which seems to be due to using a pointer rather than a flexible array > member (I'm using gcc[2]). My version (attached) still uses a flexible > array member, which doesn't see the regression; but I repalloc the > whole struct so the caller needs to save the new pointer to the tape > set. > > That doesn't entirely make sense to me, and I'm wondering if someone > else can repro that result and/or make a suggestion, because I don't > have a good explanation. I'm fine with my version of the patch, but it > would be nice to know why there's such a big difference using a pointer > versus a flexible array member. > > Regards, > Jeff Davis
I noticed another difference, I was using palloc0(), which could be one of the reason, but not sure. Tested your hashagg-20200226.patch on my laptop(Apple clang version 11.0.0), the average time is 25.9s: ``` create table t20m_1_int4(i int); copy t20m_1_int4 from program 'shuf -i 1-4294967295 -n 20000000'; analyze t20m_1_int4; ``` ``` explain analyze select * from t20m_1_int4 order by i offset 100000000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=3310741.20..3310741.20 rows=1 width=4) (actual time=25666.471..25666.471 rows=0 loops=1) -> Sort (cost=3260740.96..3310741.20 rows=20000096 width=4) (actual time=20663.065..24715.269 rows=20000000 loops=1) Sort Key: i Sort Method: external merge Disk: 274056kB -> Seq Scan on t20m_1_int4 (cost=0.00..288496.96 rows=20000096 width=4) (actual time=0.075..2749.385 rows=20000000 loops=1) Planning Time: 0.109 ms Execution Time: 25911.542 ms (7 rows) ``` But if use the palloc0() or do the MemSet() like: ``` lts = (LogicalTapeSet *) palloc0(offsetof(LogicalTapeSet, tapes) + ntapes * sizeof(LogicalTape)); ... MemSet(lts->tapes, 0, lts->nTapes * sizeof(LogicalTape)); <--- this line doesn't matter as I observed, which makes a little sense the compiler might know it's already zero. ``` The average time goes up to 26.6s: ``` explain analyze select * from t20m_1_int4 order by i offset 100000000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=3310741.20..3310741.20 rows=1 width=4) (actual time=26419.712..26419.712 rows=0 loops=1) -> Sort (cost=3260740.96..3310741.20 rows=20000096 width=4) (actual time=21430.044..25468.661 rows=20000000 loops=1) Sort Key: i Sort Method: external merge Disk: 274056kB -> Seq Scan on t20m_1_int4 (cost=0.00..288496.96 rows=20000096 width=4) (actual time=0.060..2839.452 rows=20000000 loops=1) Planning Time: 0.111 ms Execution Time: 26652.255 ms (7 rows) ``` -- Adam Lee