Sergey Shelukhin created HIVE-19418:
---------------------------------------

             Summary: add background stats updater similar to compactor
                 Key: HIVE-19418
                 URL: https://issues.apache.org/jira/browse/HIVE-19418
             Project: Hive
          Issue Type: Bug
            Reporter: Sergey Shelukhin
            Assignee: Sergey Shelukhin


There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
to make them usable in a transaction without breaking ACID (for metadata-only 
optimization). However, stats for ACID tables can still become unusable if e.g. 
two parallel inserts run - neither sees the data written by the other, so after 
both finish, the snapshots on either set of stats won't match the current 
snapshot and the stats will be unusable.

Additionally, for ACID and non-ACID tables alike, a lot of the stats, with some 
exceptions like numRows, cannot be aggregated (i.e. you cannot combine ndvs 
from two inserts), and for ACID even less can be aggregated (you cannot derive 
min/max if some rows are deleted but you don't scan the rest of the dataset).

Therefore we will add background logic to metastore (similar to, and partially 
inside, the ACID compactor) to update stats.
It will have 3 modes of operation.
1) Off.
2) Update only the stats that exist but are out of date (generating stats can 
be expensive, so if the user is only analyzing a subset of tables it should be 
able to only update that subset). We can simply look at existing stats and only 
analyze for the relevant partitions and columns.
3) On: 2 + create stats for all tables and columns missing stats.
There will also be a table parameter to skip stats update. 

In phase 1, the process will operate outside of compactor, and run analyze 
command on the table. The analyze command will automatically save the stats 
with ACID snapshot information if needed, based on HIVE-19416, so we don't need 
to do any special state management and this will work for all table types. 
However it's also more expensive.

In phase 2, we can explore adding stats collection during MM compaction that 
uses a temp table. If we don't have open writers during major compaction (so we 
overwrite all of the data), the temp table stats can simply be copied over to 
the main table with correct snapshot information, saving us a table scan.

In phase 3, we can add custom stats collection logic to full ACID compactor 
that is not query based, the same way as we'd do for (2). Alternatively we can 
wait for ACID compactor to become query based and just reuse (2).









--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to