Hello. At Mon, 18 Mar 2019 11:54:42 +0900, Masahiko Sawada <sawada.m...@gmail.com> wrote in <cad21aoc6bsm0ffepgzsv40uxofbfspe-ax095tonu5goz79...@mail.gmail.com> > Here is the performance test results. I've setup a 500MB table with > several indexes and made 10% of table dirty before each vacuum. > Compared execution time of the patched postgrse with the current HEAD > (at 'speed_up' column). In my environment, > > indexes | parallel_degree | patched | head | speed_up > ---------+-----------------+------------+------------+---------- > 0 | 0 | 238.2085 | 244.7625 | 1.0275 > 0 | 1 | 237.7050 | 244.7625 | 1.0297 > 0 | 2 | 238.0390 | 244.7625 | 1.0282 > 0 | 4 | 238.1045 | 244.7625 | 1.0280 > 0 | 8 | 237.8995 | 244.7625 | 1.0288 > 0 | 16 | 237.7775 | 244.7625 | 1.0294 > 1 | 0 | 1328.8590 | 1334.9125 | 1.0046 > 1 | 1 | 1325.9140 | 1334.9125 | 1.0068 > 1 | 2 | 1333.3665 | 1334.9125 | 1.0012 > 1 | 4 | 1329.5205 | 1334.9125 | 1.0041 > 1 | 8 | 1334.2255 | 1334.9125 | 1.0005 > 1 | 16 | 1335.1510 | 1334.9125 | 0.9998 > 2 | 0 | 2426.2905 | 2427.5165 | 1.0005 > 2 | 1 | 1416.0595 | 2427.5165 | 1.7143 > 2 | 2 | 1411.6270 | 2427.5165 | 1.7197 > 2 | 4 | 1411.6490 | 2427.5165 | 1.7196 > 2 | 8 | 1410.1750 | 2427.5165 | 1.7214 > 2 | 16 | 1413.4985 | 2427.5165 | 1.7174 > 4 | 0 | 4622.5060 | 4619.0340 | 0.9992 > 4 | 1 | 2536.8435 | 4619.0340 | 1.8208 > 4 | 2 | 2548.3615 | 4619.0340 | 1.8126 > 4 | 4 | 1467.9655 | 4619.0340 | 3.1466 > 4 | 8 | 1486.3155 | 4619.0340 | 3.1077 > 4 | 16 | 1481.7150 | 4619.0340 | 3.1174 > 8 | 0 | 9039.3810 | 8990.4735 | 0.9946 > 8 | 1 | 4807.5880 | 8990.4735 | 1.8701 > 8 | 2 | 3786.7620 | 8990.4735 | 2.3742 > 8 | 4 | 2924.2205 | 8990.4735 | 3.0745 > 8 | 8 | 2684.2545 | 8990.4735 | 3.3493 > 8 | 16 | 2672.9800 | 8990.4735 | 3.3635 > 16 | 0 | 17821.4715 | 17740.1300 | 0.9954 > 16 | 1 | 9318.3810 | 17740.1300 | 1.9038 > 16 | 2 | 7260.6315 | 17740.1300 | 2.4433 > 16 | 4 | 5538.5225 | 17740.1300 | 3.2030 > 16 | 8 | 5368.5255 | 17740.1300 | 3.3045 > 16 | 16 | 5291.8510 | 17740.1300 | 3.3523 > (36 rows)
For indexes=4,8,16, the cases with parallel_degree=4,8,16 behave almost the same. I suspect that the indexes are too-small and all the index pages were on memory and CPU is saturated. Maybe you had four cores and parallel workers more than the number had no effect. Other normal backends should have been able do almost nothing meanwhile. Usually the number of parallel workers is determined so that IO capacity is filled up but this feature intermittently saturates CPU capacity very under such a situation. I'm not sure, but what if we do index vacuum in one-tuple-by-one manner? That is, heap vacuum passes dead tuple one-by-one (or buffering few tuples) to workers and workers process it not by bulkdelete, but just tuple_delete (we don't have one). That could avoid the sleep time of heap-scan while index bulkdelete. > Attached the updated version patches. The patches apply to the current > HEAD cleanly but the 0001 patch still changes the vacuum option to a > Node since it's under the discussion. After the direction has been > decided, I'll update the patches. As for the to-be-or-not-to-be a node problem, I don't think it is needed but from the point of consistency, it seems reasonable and it is seen in other nodes that *Stmt Node holds option Node. But makeVacOpt and it's usage, and subsequent operations on the node look somewhat strange.. Why don't you just do "makeNode(VacuumOptions)"? >+ /* Estimate size for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_TUPLES */ >+ maxtuples = compute_max_dead_tuples(nblocks, nindexes > 0); If I understand this correctly, nindexes is always > 1 there. At lesat asserted that > 0 there. >+ estdt = MAXALIGN(add_size(sizeof(LVDeadTuples), I don't think the name is good. (dt menant detach by the first look for me..) >+ if (lps->nworkers_requested > 0) >+ appendStringInfo(&buf, >+ ngettext("launched %d parallel vacuum worker for >index cleanup (planned: %d, requested %d)", "planned"? >+ /* Get the next index to vacuum */ >+ if (do_parallel) >+ idx = pg_atomic_fetch_add_u32(&(lps->lvshared->nprocessed), 1); >+ else >+ idx = nprocessed++; It seems that both of the two cases can be handled using LVParallelState and most of the branches by lps or do_parallel can be removed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center