[jira] [Commented] (HIVE-2471) Add timestamp column with index to the partition stats table.

jirapos...@reviews.apache.org (Commented) (JIRA) Tue, 27 Sep 2011 16:59:14 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116018#comment-13116018
 ]


jirapos...@reviews.apache.org commented on HIVE-2471:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2079/
-----------------------------------------------------------

Review request for hive, Yongqiang He and Ning Zhang.


Summary
-------

I added a timestamp column ts to the partition statistics table which defaults 
to the current_timestamp.  I also added code to create an index on that column, 
and verify that index exists when we check if the table exists.

I also took the opportunity to fix another problem.  Every time we change the 
schema of the partition statistics table we give it a slightly different name, 
like PARTITION_STATS, PARITION_STATISTICS, PARTITION_STAT_TBL, etc.  Instead, I 
want to put a number at the end of the table name, here I have 
PARTITION_STATS_V2, instead of trying to come up on a new variation of name, we 
can just increment the final number, this will also make it easy to identify 
old tables which can be dropped.

Checking whether the index exists may not be worth the time it takes.  We have 
to check this every time we init JDBCStatsPublisher, unless the table doesn't 
exist, and if it doesn't exist, it's not the end of the world, it just means 
any scripts which try to use the index will be slower, and the index can always 
be added later.  Also, the chance the program creates the table, but is 
interrupted before it can create the index is low.  I added the check because I 
thought the chance of having to try and find the reason why Hive slowed down, 
and having to find that a clean up script is running slow, and hence holding 
the locks for a long time, sounded painful, and hence the check would be worth 
it, but I am open to debate.


This addresses bug HIVE-2471.
    https://issues.apache.org/jira/browse/HIVE-2471


Diffs
-----

  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1175957 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
 1175957 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
1175957 

Diff: https://reviews.apache.org/r/2079/diff


Testing
-------

I ran TestStatsPublisherEnhanced using both derby and MySQL, and verified all 
the tests succeeded.

I also ran a few queries and verified that the table and index were created and 
that the rows, including timestamp, appeared in the table.


Thanks,

Kevin


                
> Add timestamp column with index to the partition stats table.
> -------------------------------------------------------------
>
>                 Key: HIVE-2471
>                 URL: https://issues.apache.org/jira/browse/HIVE-2471
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>
> Occasionally, when entries are added to the partition stats table the program 
> is halted before it can delete those entries, by an exception, keyboard 
> interrupt, etc.  These build up to the point where the table gets very large, 
> and it hurts the performance of the update statement which is often called.  
> In order to fix this, I am adding a column to the table which is 
> auto-populated with the current timestamp.  I am also adding an index on this 
> column.  This will allow us to create scripts that go through periodically 
> and clean out old entries from the table.  The index will help to keep the 
> runtime of these scripts short, and hence reduce the amount of time they need 
> to lock the table/indexes for.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2471) Add timestamp column with index to the partition stats table.

Reply via email to