Hi, you mean if we don't add new compiler options the compiler will do the loop unrolling using SIMD automatically? Beside the function calls, cache miss etc, for VOPS I think the call stack is squeezing too, but the JIT optimize still process rows one by one.
Konstantin Knizhnik <k.knizh...@postgrespro.ru> 于2019年11月28日周四 下午3:08写道: > > > On 27.11.2019 19:05, Tomas Vondra wrote: > > On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote: > >> > >> > >> On 25.11.2019 18:24, Merlin Moncure wrote: > >>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik > >>> <k.knizh...@postgrespro.ru> wrote: > >>>> JIT was not able to significantly (times) increase speed on Q1 query? > >>>> Experiment with VOPS shows that used aggregation algorithm itself > >>>> is not > >>>> a bottleneck. > >>>> Profile also give no answer for this question. > >>>> Any ideas? > >>> Well, in the VOPS variant around 2/3 of the time is spent in routines > >>> that are obviously aggregation. In the JIT version, it's around 20%. > >>> So this suggests that the replacement execution engine is more > >>> invasive. I would also guess (!) that the VOPS engine optimizes fewer > >>> classes of query plan. ExecScan for example, looks to be completely > >>> optimized out VOPS but is still utilized in the JIT engine. > >> > >> The difference in fraction of time spent in aggregate calculation is > >> not so large (2 times vs. 10 times). > >> I suspected that a lot of time is spent in relation traversal code, > >> tuple unpacking and visibility checks. > >> To check this hypothesis I have implement in-memory table access > >> method which stores tuples in unpacked form and > >> doesn't perform any visibility checks at all. > >> Results were not so existed. I have to disable parallel execution > >> (because it is not possible for tuples stored in backend private > >> memory). > >> Results are the following: > >> > >> lineitem: 13736 msec > >> inmem_lineitem: 10044 msec > >> vops_lineitem: 1945 msec > >> > >> The profile of inmem_lineitem is the following: > >> > >> 16.79% postgres postgres [.] float4_accum > >> 12.86% postgres postgres [.] float8_accum > >> 5.83% postgres postgres [.] TupleHashTableHash.isra.8 > >> 4.44% postgres postgres [.] lookup_hash_entries > >> 3.37% postgres postgres [.] check_float8_array > >> 3.11% postgres postgres [.] tuplehash_insert > >> 2.91% postgres postgres [.] hash_uint32 > >> 2.83% postgres postgres [.] ExecScan > >> 2.56% postgres postgres [.] inmem_getnextslot > >> 2.22% postgres postgres [.] FunctionCall1Coll > >> 2.14% postgres postgres [.] LookupTupleHashEntry > >> 1.95% postgres postgres [.] TupleHashTableMatch.isra.9 > >> 1.76% postgres postgres [.] pg_detoast_datum > >> 1.58% postgres postgres [.] AggCheckCallContext > >> 1.57% postgres postgres [.] tts_minimal_clear > >> 1.35% postgres perf-3054.map [.] 0x00007f558db60010 > >> 1.23% postgres postgres [.] fetch_input_tuple > >> 1.15% postgres postgres [.] SeqNext > >> 1.06% postgres postgres [.] ExecAgg > >> 1.00% postgres postgres [.] tts_minimal_store_tuple > >> > >> So now fraction of time spent in aggregation is increased to 30% (vs. > >> 20% for lineitem and 42% for vops_lineitem). > >> Looks like the main bottleneck now is hashagg. VOPS is accessing hash > >> about 10 times less (because it accumulates values for the whole tile). > >> And it explains still large difference bwtween vops_lineitem and > >> inmem_lineitem. > >> > >> If we remove aggregation and rewrite Q1 query as: > >> select > >> avg(l_quantity) as sum_qty, > >> avg(l_extendedprice) as sum_base_price, > >> avg(l_extendedprice*(1-l_discount)) as sum_disc_price, > >> avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, > >> avg(l_quantity) as avg_qty, > >> avg(l_extendedprice) as avg_price, > >> avg(l_discount) as avg_disc, > >> count(*) as count_order > >> from > >> inmem_lineitem > >> where > >> l_shipdate <= '1998-12-01'; > >> > >> then results are the following: > >> lineitem: 9805 msec > >> inmem_lineitem: 6257 msec > >> vops_lineitem: 1865 msec > >> > >> and now profile of inmem_lineitem is: > >> > >> 25.27% postgres postgres [.] float4_accum > >> 21.86% postgres postgres [.] float8_accum > >> 5.49% postgres postgres [.] check_float8_array > >> 4.57% postgres postgres [.] ExecScan > >> 2.61% postgres postgres [.] AggCheckCallContext > >> 2.30% postgres postgres [.] pg_detoast_datum > >> 2.10% postgres postgres [.] inmem_getnextslot > >> 1.81% postgres postgres [.] SeqNext > >> 1.73% postgres postgres [.] fetch_input_tuple > >> 1.61% postgres postgres [.] ExecAgg > >> 1.23% postgres postgres [.] MemoryContextReset > >> > >> But still more than 3 times difference with VOPS! > >> Something is wrong here... > >> > > > > I have no idea what VOPS does, but IIRC one of the bottlenecks compared > > to various column stores is our iterative execution model, which makes > > it difficult/imposible to vectorize operations. That's likely why the > > accum functions are so high in the CPU profile. > > > > regards > > > > VOPS is doing very simple thing: it replaces scala types with vector > (tiles) and define all standard operations for them. > Also it provides Postgres aggregate for this types. > So while for normal Postgres table, the query > > select sum(x) from T; > > calls float4_accum for each row of T, the same query in VOPS will call > vops_float4_avg_accumulate for each tile which contains 64 elements. > So vops_float4_avg_accumulate is called 64 times less than float4_accum. > And inside it contains straightforward loop: > > for (i = 0; i < TILE_SIZE; i++) { > sum += opd->payload[i]; > } > > which can be optimized by compiler (loop unrolling, use of SIMD > instructions,...). > So no wonder that VOPS is faster than Postgres executor. > But Postgres now contains JIT and it is used in this case. > So interpretation overhead of executor should be mostly eliminated by JIT. > In theory, perfect JIT code should process rows of horizontal data model > at the same speed as vector executor processing columns of vertical data > model. > Vertical model provides signficatn advantages when a query affect only > small fraction of rows. > But in case of Q1 we are calculating 8 aggregates for just 4 columns. > And inmem_lineitem is actually projection of original lineitem table > containing only columns needed for this query. > So amount of fetched data in this case is almost the same for horizontal > and vertical data models. > Effects of CPU caches should not also play significant role in this case. > That is why it is not quite clear to me why there is still big > difference (3 times) between VOPS and in-memory table and not so large > difference between normal and in-memory tables. > > Concerning large percent spent in accumulate function - I do not agree > with you. What this query is actually doing is just calculating aggregates. > The less is interpretation overhead the larger percent of time we should > spent in aggregate function. > May be the whole infrastructure of Postgres aggregates adds too large > overhead (check_float8_array, function calls,...) and in case of VOPS > this overhead is divided by 64. > > > -- > Konstantin Knizhnik > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > > > > -- Guang-Nan He