Re: autovac issue with large number of tables

Kasahara Tatsuhito Wed, 02 Dec 2020 18:46:41 -0800

On Wed, Dec 2, 2020 at 7:11 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> On Wed, Dec 2, 2020 at 3:33 PM Fujii Masao <masao.fu...@oss.nttdata.com> 
> wrote:
> >
> >
> >
> > On 2020/12/02 12:53, Masahiko Sawada wrote:
> > > On Tue, Dec 1, 2020 at 5:31 PM Masahiko Sawada <sawada.m...@gmail.com> 
> > > wrote:
> > >>
> > >> On Tue, Dec 1, 2020 at 4:32 PM Fujii Masao <masao.fu...@oss.nttdata.com> 
> > >> wrote:
> > >>>
> > >>>
> > >>>
> > >>> On 2020/12/01 16:23, Masahiko Sawada wrote:
> > >>>> On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito
> > >>>> <kasahara.tatsuh...@gmail.com> wrote:
> > >>>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao 
> > >>>>> <masao.fu...@oss.nttdata.com> wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On 2020/11/30 10:43, Masahiko Sawada wrote:
> > >>>>>>> On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
> > >>>>>>> <kasahara.tatsuh...@gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi, Thanks for you comments.
> > >>>>>>>>
> > >>>>>>>> On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao 
> > >>>>>>>> <masao.fu...@oss.nttdata.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 2020/11/27 18:38, Kasahara Tatsuhito wrote:
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao 
> > >>>>>>>>>> <masao.fu...@oss.nttdata.com> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 2020/11/26 10:41, Kasahara Tatsuhito wrote:
> > >>>>>>>>>>>> On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada 
> > >>>>>>>>>>>> <sawada.m...@gmail.com> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
> > >>>>>>>>>>>>> <kasahara.tatsuh...@gmail.com> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada 
> > >>>>>>>>>>>>>> <sawada.m...@gmail.com> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
> > >>>>>>>>>>>>>>> <kasahara.tatsuh...@gmail.com> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> > >>>>>>>>>>>>>>>> <kasahara.tatsuh...@gmail.com> wrote:
> > >>>>>>>>>>>>>>>>>> I wonder if we could have table_recheck_autovac do two 
> > >>>>>>>>>>>>>>>>>> probes of the stats
> > >>>>>>>>>>>>>>>>>> data.  First probe the existing stats data, and if it 
> > >>>>>>>>>>>>>>>>>> shows the table to
> > >>>>>>>>>>>>>>>>>> be already vacuumed, return immediately.  If not, *then* 
> > >>>>>>>>>>>>>>>>>> force a stats
> > >>>>>>>>>>>>>>>>>> re-read, and check a second time.
> > >>>>>>>>>>>>>>>>> Does the above mean that the second and subsequent 
> > >>>>>>>>>>>>>>>>> table_recheck_autovac()
> > >>>>>>>>>>>>>>>>> will be improved to first check using the previous 
> > >>>>>>>>>>>>>>>>> refreshed statistics?
> > >>>>>>>>>>>>>>>>> I think that certainly works.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> If that's correct, I'll try to create a patch for the PoC
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I still don't know how to reproduce Jim's troubles, but I 
> > >>>>>>>>>>>>>>>> was able to reproduce
> > >>>>>>>>>>>>>>>> what was probably a very similar problem.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> This problem seems to be more likely to occur in cases 
> > >>>>>>>>>>>>>>>> where you have
> > >>>>>>>>>>>>>>>> a large number of tables,
> > >>>>>>>>>>>>>>>> i.e., a large amount of stats, and many small tables need 
> > >>>>>>>>>>>>>>>> VACUUM at
> > >>>>>>>>>>>>>>>> the same time.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> So I followed Tom's advice and created a patch for the PoC.
> > >>>>>>>>>>>>>>>> This patch will enable a flag in the table_recheck_autovac 
> > >>>>>>>>>>>>>>>> function to use
> > >>>>>>>>>>>>>>>> the existing stats next time if VACUUM (or ANALYZE) has 
> > >>>>>>>>>>>>>>>> already been done
> > >>>>>>>>>>>>>>>> by another worker on the check after the stats have been 
> > >>>>>>>>>>>>>>>> updated.
> > >>>>>>>>>>>>>>>> If the tables continue to require VACUUM after the 
> > >>>>>>>>>>>>>>>> refresh, then a refresh
> > >>>>>>>>>>>>>>>> will be required instead of using the existing statistics.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I did simple test with HEAD and HEAD + this PoC patch.
> > >>>>>>>>>>>>>>>> The tests were conducted in two cases.
> > >>>>>>>>>>>>>>>> (I changed few configurations. see attached scripts)
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> 1. Normal VACUUM case
> > >>>>>>>>>>>>>>>>        - SET autovacuum = off
> > >>>>>>>>>>>>>>>>        - CREATE tables with 100 rows
> > >>>>>>>>>>>>>>>>        - DELETE 90 rows for each tables
> > >>>>>>>>>>>>>>>>        - SET autovacuum = on and restart PostgreSQL
> > >>>>>>>>>>>>>>>>        - Measure the time it takes for all tables to be 
> > >>>>>>>>>>>>>>>> VACUUMed
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> 2. Anti wrap round VACUUM case
> > >>>>>>>>>>>>>>>>        - CREATE brank tables
> > >>>>>>>>>>>>>>>>        - SELECT all of these tables (for generate stats)
> > >>>>>>>>>>>>>>>>        - SET autovacuum_freeze_max_age to low values and 
> > >>>>>>>>>>>>>>>> restart PostgreSQL
> > >>>>>>>>>>>>>>>>        - Consumes a lot of XIDs by using txid_curent()
> > >>>>>>>>>>>>>>>>        - Measure the time it takes for all tables to be 
> > >>>>>>>>>>>>>>>> VACUUMed
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> For each test case, the following results were obtained by 
> > >>>>>>>>>>>>>>>> changing
> > >>>>>>>>>>>>>>>> autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> > >>>>>>>>>>>>>>>> Also changing num of tables to 1000, 5000, 10000 and 20000.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Due to the poor VM environment (2 VCPU/4 GB), the results 
> > >>>>>>>>>>>>>>>> are a little unstable,
> > >>>>>>>>>>>>>>>> but I think it's enough to ask for a trend.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> ===========================================================================
> > >>>>>>>>>>>>>>>> [1.Normal VACUUM case]
> > >>>>>>>>>>>>>>>>       tables:1000
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 1:   (HEAD) 20 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  20 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 2:   (HEAD) 18 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  16 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 3:   (HEAD) 18 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  16 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 5:   (HEAD) 19 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  17 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 10:  (HEAD) 19 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  17 sec
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>       tables:5000
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 1:   (HEAD) 77 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  78 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 2:   (HEAD) 61 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  43 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 3:   (HEAD) 38 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  38 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 5:   (HEAD) 45 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  37 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 10:  (HEAD) 43 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  35 sec
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>       tables:10000
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 1:   (HEAD) 152 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  153 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 2:   (HEAD) 119 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)   98 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 3:   (HEAD)  87 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)   78 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 5:   (HEAD) 100 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)   66 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 10:  (HEAD)  97 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)   56 sec
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>       tables:20000
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 1:   (HEAD) 338 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  339 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 2:   (HEAD) 231 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  229 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 3:   (HEAD) 220 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  191 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 5:   (HEAD) 234 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  147 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 10:  (HEAD) 320 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  113 sec
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> [2.Anti wrap round VACUUM case]
> > >>>>>>>>>>>>>>>>       tables:1000
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 1:   (HEAD) 19 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 18 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 2:   (HEAD) 14 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 15 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 3:   (HEAD) 14 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 14 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 5:   (HEAD) 14 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 16 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 10:  (HEAD) 16 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 14 sec
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>       tables:5000
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 1:   (HEAD) 69 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 69 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 2:   (HEAD) 66 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 47 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 3:   (HEAD) 59 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 37 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 5:   (HEAD) 39 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 28 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 10:  (HEAD) 39 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 29 sec
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>       tables:10000
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 1:   (HEAD) 139 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 138 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 2:   (HEAD) 130 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  86 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 3:   (HEAD) 120 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  68 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 5:   (HEAD)  96 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  41 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 10:  (HEAD)  90 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  39 sec
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>       tables:20000
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 1:   (HEAD) 313 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 331 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 2:   (HEAD) 209 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 201 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 3:   (HEAD) 227 sec VS (with 
> > >>>>>>>>>>>>>>>> patch) 141 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 5:   (HEAD) 236 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  88 sec
> > >>>>>>>>>>>>>>>>        autovacuum_max_workers 10:  (HEAD) 309 sec VS (with 
> > >>>>>>>>>>>>>>>> patch)  74 sec
> > >>>>>>>>>>>>>>>> ===========================================================================
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> The cases without patch, the scalability of the worker has 
> > >>>>>>>>>>>>>>>> decreased
> > >>>>>>>>>>>>>>>> as the number of tables has increased.
> > >>>>>>>>>>>>>>>> In fact, the more workers there are, the longer it takes 
> > >>>>>>>>>>>>>>>> to complete
> > >>>>>>>>>>>>>>>> VACUUM to all tables.
> > >>>>>>>>>>>>>>>> The cases with patch, it shows good scalability with 
> > >>>>>>>>>>>>>>>> respect to the
> > >>>>>>>>>>>>>>>> number of workers.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> It seems a good performance improvement even without the 
> > >>>>>>>>>>>>>>> patch of
> > >>>>>>>>>>>>>>> shared memory based stats collector.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Sounds great!
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Note that perf top results showed that 
> > >>>>>>>>>>>>>>>> hash_search_with_hash_value,
> > >>>>>>>>>>>>>>>> hash_seq_search and
> > >>>>>>>>>>>>>>>> pgstat_read_statsfiles are dominant during VACUUM in all 
> > >>>>>>>>>>>>>>>> patterns,
> > >>>>>>>>>>>>>>>> with or without the patch.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Therefore, there is still a need to find ways to optimize 
> > >>>>>>>>>>>>>>>> the reading
> > >>>>>>>>>>>>>>>> of large amounts of stats.
> > >>>>>>>>>>>>>>>> However, this patch is effective in its own right, and 
> > >>>>>>>>>>>>>>>> since there are
> > >>>>>>>>>>>>>>>> only a few parts to modify,
> > >>>>>>>>>>>>>>>> I think it should be able to be applied to current 
> > >>>>>>>>>>>>>>>> (preferably
> > >>>>>>>>>>>>>>>> pre-v13) PostgreSQL.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> +1
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> +
> > >>>>>>>>>>>>>>> +       /* We might be better to refresh stats */
> > >>>>>>>>>>>>>>> +       use_existing_stats = false;
> > >>>>>>>>>>>>>>>          }
> > >>>>>>>>>>>>>>> +   else
> > >>>>>>>>>>>>>>> +   {
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> -   heap_freetuple(classTup);
> > >>>>>>>>>>>>>>> +       heap_freetuple(classTup);
> > >>>>>>>>>>>>>>> +       /* The relid has already vacuumed, so we might be 
> > >>>>>>>>>>>>>>> better to
> > >>>>>>>>>>>>>>> use exiting stats */
> > >>>>>>>>>>>>>>> +       use_existing_stats = true;
> > >>>>>>>>>>>>>>> +   }
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> With that patch, the autovacuum process refreshes the stats 
> > >>>>>>>>>>>>>>> in the
> > >>>>>>>>>>>>>>> next check if it finds out that the table still needs to be 
> > >>>>>>>>>>>>>>> vacuumed.
> > >>>>>>>>>>>>>>> But I guess it's not necessarily true because the next 
> > >>>>>>>>>>>>>>> table might be
> > >>>>>>>>>>>>>>> vacuumed already. So I think we might want to always use 
> > >>>>>>>>>>>>>>> the existing
> > >>>>>>>>>>>>>>> for the first check. What do you think?
> > >>>>>>>>>>>>>> Thanks for your comment.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> If we assume the case where some workers vacuum on large 
> > >>>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>> and a single worker vacuum on small tables, the processing
> > >>>>>>>>>>>>>> performance of the single worker will be slightly lower if 
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>> existing statistics are checked every time.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> In fact, at first I tried to check the existing stats every 
> > >>>>>>>>>>>>>> time,
> > >>>>>>>>>>>>>> but the performance was slightly worse in cases with a small 
> > >>>>>>>>>>>>>> number of workers.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Do you have this benchmark result?
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>> (Checking the existing stats is lightweight , but at high 
> > >>>>>>>>>>>>>> frequency,
> > >>>>>>>>>>>>>>       it affects processing performance.)
> > >>>>>>>>>>>>>> Therefore, at after refresh statistics, determine whether 
> > >>>>>>>>>>>>>> autovac
> > >>>>>>>>>>>>>> should use the existing statistics.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Yeah, since the test you used uses a lot of small tables, if 
> > >>>>>>>>>>>>> there are
> > >>>>>>>>>>>>> a few workers, checking the existing stats is unlikely to 
> > >>>>>>>>>>>>> return true
> > >>>>>>>>>>>>> (no need to vacuum). So the cost of existing stats check ends 
> > >>>>>>>>>>>>> up being
> > >>>>>>>>>>>>> overhead. Not sure how slow always checking the existing 
> > >>>>>>>>>>>>> stats was,
> > >>>>>>>>>>>>> but given that the shared memory based stats collector patch 
> > >>>>>>>>>>>>> could
> > >>>>>>>>>>>>> improve the performance of refreshing stats, it might be 
> > >>>>>>>>>>>>> better not to
> > >>>>>>>>>>>>> check the existing stats frequently like the patch does. 
> > >>>>>>>>>>>>> Anyway, I
> > >>>>>>>>>>>>> think it’s better to evaluate the performance improvement 
> > >>>>>>>>>>>>> with other
> > >>>>>>>>>>>>> cases too.
> > >>>>>>>>>>>> Yeah, I would like to see how much the performance changes in 
> > >>>>>>>>>>>> other cases.
> > >>>>>>>>>>>> In addition, if the shared-based-stats patch is applied, we 
> > >>>>>>>>>>>> won't need to reload
> > >>>>>>>>>>>> a huge stats file, so we will just have to check the stats on
> > >>>>>>>>>>>> shared-mem every time.
> > >>>>>>>>>>>> Perhaps the logic of table_recheck_autovac could be simpler.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>> BTW, I found some typos in comments, so attache a  fixed 
> > >>>>>>>>>>>>>> version.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The patch adds some duplicated codes into 
> > >>>>>>>>>>> table_recheck_autovac().
> > >>>>>>>>>>> It's better to make the common function performing them and make
> > >>>>>>>>>>> table_recheck_autovac() call that common function, to simplify 
> > >>>>>>>>>>> the code.
> > >>>>>>>>>> Thanks for your comment.
> > >>>>>>>>>> Hmm.. I've cut out the duplicate part.
> > >>>>>>>>>> Attach the patch.
> > >>>>>>>>>> Could you confirm that it fits your expecting?
> > >>>>>>>>>
> > >>>>>>>>> Yes, thanks for updataing the patch! Here are another review 
> > >>>>>>>>> comments.
> > >>>>>>>>>
> > >>>>>>>>> +       shared = pgstat_fetch_stat_dbentry(InvalidOid);
> > >>>>>>>>> +       dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
> > >>>>>>>>>
> > >>>>>>>>> When using the existing stats, ISTM that these are not necessary 
> > >>>>>>>>> and
> > >>>>>>>>> we can reuse "shared" and "dbentry" obtained before. Right?
> > >>>>>>>> Yeah, but unless autovac_refresh_stats() is called, these functions
> > >>>>>>>> read the information from the
> > >>>>>>>> local hash table without re-read the stats file, so the process is 
> > >>>>>>>> very light.
> > >>>>>>>> Therefore, I think, it is better to keep the current logic to keep 
> > >>>>>>>> the
> > >>>>>>>> code simple.
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> +               /* We might be better to refresh stats */
> > >>>>>>>>> +               use_existing_stats = false;
> > >>>>>>>>>
> > >>>>>>>>> I think that we should add more comments about why it's better to
> > >>>>>>>>> refresh the stats in this case.
> > >>>>>>>>>
> > >>>>>>>>> +               /* The relid has already vacuumed, so we might be 
> > >>>>>>>>> better to use existing stats */
> > >>>>>>>>> +               use_existing_stats = true;
> > >>>>>>>>>
> > >>>>>>>>> I think that we should add more comments about why it's better to
> > >>>>>>>>> reuse the stats in this case.
> > >>>>>>>> I added  comments.
> > >>>>>>>>
> > >>>>>>>> Attache the patch.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> Thank you for updating the patch. Here are some small comments on 
> > >>>>>>> the
> > >>>>>>> latest (v4) patch.
> > >>>>>>>
> > >>>>>>> +    * So if the last time we checked a table that was already 
> > >>>>>>> vacuumed after
> > >>>>>>> +    * refres stats, check the current statistics before refreshing 
> > >>>>>>> it.
> > >>>>>>> +    */
> > >>>>>>>
> > >>>>>>> s/refres/refresh/
> > >>>>> Thanks! fixed.
> > >>>>> Attached the patch.
> > >>>>>
> > >>>>>>>
> > >>>>>>> -----
> > >>>>>>> +/* Counter to determine if statistics should be refreshed */
> > >>>>>>> +static bool use_existing_stats = false;
> > >>>>>>> +
> > >>>>>>>
> > >>>>>>> I think 'use_existing_stats' can be declared within 
> > >>>>>>> table_recheck_autovac().
> > >>>>>>>
> > >>>>>>> -----
> > >>>>>>> While testing the performance, I realized that the statistics are
> > >>>>>>> reset every time vacuumed one table, leading to re-reading the stats
> > >>>>>>> file even if 'use_existing_stats' is true. Please refer that 
> > >>>>>>> vacuum()
> > >>>>>>> eventually calls AtEOXact_PgStat() which calls to
> > >>>>>>> pgstat_clear_snapshot().
> > >>>>>>
> > >>>>>> Good catch!
> > >>>>>>
> > >>>>>>
> > >>>>>>> I believe that's why the performance of the
> > >>>>>>> method of always checking the existing stats wasn’t good as 
> > >>>>>>> expected.
> > >>>>>>> So if we save the statistics somewhere and use it for rechecking, 
> > >>>>>>> the
> > >>>>>>> results of the performance benchmark will differ between these two
> > >>>>>>> methods.
> > >>>>> Thanks for you checks.
> > >>>>> But, if a worker did vacuum(), that means this worker had determined
> > >>>>> need vacuum in the
> > >>>>> table_recheck_autovac(). So, use_existing_stats set to false, and next
> > >>>>> time, refresh stats.
> > >>>>> Therefore I think the current patch is fine, as we want to avoid
> > >>>>> unnecessary refreshing of
> > >>>>> statistics before the actual vacuum(), right?
> > >>>>
> > >>>> Yes, you're right.
> > >>>>
> > >>>> When I benchmarked the performance of the method of always checking
> > >>>> existing stats I edited your patch so that it sets use_existing_stats
> > >>>> = true even if the second check is false (i.g., vacuum is needed).
> > >>>> And the result I got was worse than expected especially in the case of
> > >>>> a few autovacuum workers. But it doesn't evaluate the performance of
> > >>>> that method rightly as the stats snapshot is cleared every time
> > >>>> vacuum. Given you had similar results, I guess you used a similar way
> > >>>> when evaluating it, is it right? If so, it’s better to fix this issue
> > >>>> and see how the performance benchmark results will differ.
> > >>>>
> > >>>> For example, the results of the test case with 10000 tables and 1
> > >>>> autovacuum worker I reported before was:
> > >>>>
> > >>>> 10000 tables:
> > >>>>      autovac_workers 1  : 158s,157s, 290s
> > >>>>
> > >>>> But after fixing that issue in the third method (always checking the
> > >>>> existing stats), the results are:
> > >>>
> > >>> Could you tell me how you fixed that issue? You copied the stats to
> > >>> somewhere as you suggested or skipped pgstat_clear_snapshot() as
> > >>> I suggested?
> > >>
> > >> I used the way you suggested in this quick test; skipped
> > >> pgstat_clear_snapshot().
> > >>
> > >>>
> > >>> Kasahara-san seems not to like the latter idea because it might
> > >>> cause bad side effect. So we should use the former idea?
> > >>
> > >> Not sure. I'm also concerned about the side effect but I've not checked 
> > >> yet.
> > >>
> > >> Since probably there is no big difference between the two ways in
> > >> terms of performance I'm going to see how the performance benchmark
> > >> result will change first.
> > >
> > > I've tested performance improvement again. From the left the execution
> > > time of the current HEAD, Kasahara-san's patch, the method of always
> > > checking the existing stats (using approach suggested by Fujii-san),
> > > in seconds.
> > >
> > > 1000 tables:
> > >     autovac_workers 1  : 13s, 13s, 13s
> > >     autovac_workers 2  : 6s, 4s, 4s
> > >     autovac_workers 3  : 3s, 4s, 3s
> > >     autovac_workers 5  : 3s, 3s, 2s
> > >     autovac_workers 10: 2s, 3s, 2s
> > >
> > > 5000 tables:
> > >     autovac_workers 1  : 71s, 71s, 72s
> > >     autovac_workers 2  : 37s, 32s, 32s
> > >     autovac_workers 3  : 29s, 26s, 26s
> > >     autovac_workers 5  : 20s, 19s, 18s
> > >     autovac_workers 10: 13s, 8s, 8s
> > >
> > > 10000 tables:
> > >     autovac_workers 1  : 158s,157s, 159s
> > >     autovac_workers 2  : 80s, 53s, 78s
> > >     autovac_workers 3  : 75s, 67s, 67s
> > >     autovac_workers 5  : 61s, 42s, 42s
> > >     autovac_workers 10: 69s, 26s, 25s
> > >
> > > 20000 tables:
> > >     autovac_workers 1  : 379s, 380s, 389s
> > >     autovac_workers 2  : 236s, 232s, 233s
> > >     autovac_workers 3  : 222s, 181s, 182s
> > >     autovac_workers 5  : 212s, 132s, 139s
> > >     autovac_workers 10: 317s, 91s, 89s
> > >
> > > I don't see a big difference between Kasahara-san's patch and the
> > > method of always checking the existing stats.
> >
> > Thanks for doing the benchmark!
> >
> > This benchmark result makes me think that we don't need to tweak
> > AtEOXact_PgStat() and can use Kasahara-san approach.
> > That's good news :)
>
> Yeah, given that all autovaucum workers have the list of tables to
> vacuum in the same order in most cases, the assumption in
> Kasahara-san’s patch that if a worker needs to vacuum a table it’s
> unlikely that it will be able to skip the next table using the current
> snapshot of stats makes sense to me.
>
> One small comment on v6 patch:
>
> + /* When we decide to do vacuum or analyze, the existing stats cannot
> + * be reused in the next cycle because it's cleared at the end of vacuum
> + * or analyze (by AtEOXact_PgStat()).
> + */
> + use_existing_stats = false;
>
> I think the comment should start on the second line (i.g., \n is
> needed after /*).
Oops, thanks.
Fixed.


Best regards,

>
> Regards,
>
> --
> Masahiko Sawada
> EnterpriseDB:  https://www.enterprisedb.com/



-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

v7_mod_table_recheck_autovac.patch
Description: Binary data

Re: autovac issue with large number of tables

Reply via email to