[ https://issues.apache.org/jira/browse/HIVE-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116018#comment-13116018 ]
jirapos...@reviews.apache.org commented on HIVE-2471: ----------------------------------------------------- ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2079/ ----------------------------------------------------------- Review request for hive, Yongqiang He and Ning Zhang. Summary ------- I added a timestamp column ts to the partition statistics table which defaults to the current_timestamp. I also added code to create an index on that column, and verify that index exists when we check if the table exists. I also took the opportunity to fix another problem. Every time we change the schema of the partition statistics table we give it a slightly different name, like PARTITION_STATS, PARITION_STATISTICS, PARTITION_STAT_TBL, etc. Instead, I want to put a number at the end of the table name, here I have PARTITION_STATS_V2, instead of trying to come up on a new variation of name, we can just increment the final number, this will also make it easy to identify old tables which can be dropped. Checking whether the index exists may not be worth the time it takes. We have to check this every time we init JDBCStatsPublisher, unless the table doesn't exist, and if it doesn't exist, it's not the end of the world, it just means any scripts which try to use the index will be slower, and the index can always be added later. Also, the chance the program creates the table, but is interrupted before it can create the index is low. I added the check because I thought the chance of having to try and find the reason why Hive slowed down, and having to find that a clean up script is running slow, and hence holding the locks for a long time, sounded painful, and hence the check would be worth it, but I am open to debate. This addresses bug HIVE-2471. https://issues.apache.org/jira/browse/HIVE-2471 Diffs ----- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1175957 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1175957 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 1175957 Diff: https://reviews.apache.org/r/2079/diff Testing ------- I ran TestStatsPublisherEnhanced using both derby and MySQL, and verified all the tests succeeded. I also ran a few queries and verified that the table and index were created and that the rows, including timestamp, appeared in the table. Thanks, Kevin > Add timestamp column with index to the partition stats table. > ------------------------------------------------------------- > > Key: HIVE-2471 > URL: https://issues.apache.org/jira/browse/HIVE-2471 > Project: Hive > Issue Type: Improvement > Reporter: Kevin Wilfong > Assignee: Kevin Wilfong > > Occasionally, when entries are added to the partition stats table the program > is halted before it can delete those entries, by an exception, keyboard > interrupt, etc. These build up to the point where the table gets very large, > and it hurts the performance of the update statement which is often called. > In order to fix this, I am adding a column to the table which is > auto-populated with the current timestamp. I am also adding an index on this > column. This will allow us to create scripts that go through periodically > and clean out old entries from the table. The index will help to keep the > runtime of these scripts short, and hence reduce the amount of time they need > to lock the table/indexes for. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira