Hi Denis!

> 7 дек. 2020 г., в 18:23, Смирнов Денис <s...@arenadata.io> написал(а):
> 
> I suggest a refactoring of analyze AM API as it is too much heap specific at 
> the moment. The problem was inspired by Greenplum’s analyze improvement for 
> append-optimized row and column AM with variable size compressed blocks.
> Currently we do analyze in two steps.
> 
> 1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
> 2. Collect tuples into reservoir with algorithm Z from Vitter.
> 
> So this doesn’t work for AMs using variable sized physical blocks for 
> example. They need weight random sampling (WRS) algorithms like A-Chao or 
> logical blocks to follow S-Knuth (and have a problem with 
> RelationGetNumberOfBlocks() estimating a physical number of blocks). Another 
> problem with columns - they are not passed to analyze begin scan and can’t 
> benefit from column storage at ANALYZE TABLE (COL).
> 
> The suggestion is to replace table_scan_analyze_next_block() and 
> table_scan_analyze_next_tuple() with a single function: 
> table_acquire_sample_rows(). The AM implementation of 
> table_acquire_sample_rows() can use the BlockSampler functions if it wants 
> to, but if the AM is not block-oriented, it could do something else. This 
> suggestion also passes VacAttrStats to table_acquire_sample_rows() for 
> column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and 
> PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be block-oriented.

Just few random notes about the idea.
Heap pages are not of a fixed size, when measured in tuple count. And comment 
in the codes describes it.
 * Although every row has an equal chance of ending up in the final
 * sample, this sampling method is not perfect: not every possible
 * sample has an equal chance of being selected.  For large relations
 * the number of different blocks represented by the sample tends to be
 * too small.  We can live with that for now.  Improvements are welcome.

Current implementation provide framework with shared code. Though this 
framework is only suitable for block-of-tuples AMs. And have statistical 
downsides when count of tuples varies too much.
Maybe can we just provide a richer API? To have both: avoid copying code and 
provide flexibility.

Best regards, Andrey Borodin.

Reply via email to