-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1737/
-----------------------------------------------------------

Review request for hive and Ning Zhang.


Summary
-------

I considered two different strategies for handling duplicates after removing 
the primary key from the table.

1) Go back to performing a select, and then updating if a row for the file 
exists, or inserting a new record otherwise.

2) Always insert records and then during aggregation get the max value for each 
statistic with a group by on the file name, and then aggregate those statistics.

This diff contains the code for option 2.  I determined this to be the better 
option by adding a couple stress tests to TestStatsPublisherEnhanced, and then 
comparing the run times for the two implementations using derby and MySQL.  The 
two tests checked the performance when inserting a couple hundred rows for each 
of two files, and inserting several hundred rows, each for a different file.  
In each case, when i ran the tests on my machine there wasn't much difference 
for derby, but for MySQL I was seeing both tests run about 100 ms faster for 
MySQL.  I ran both tests several times, to confirm what I was seeing.

Note that previously, if statistics were added for a file, and then statistics 
were added again for that same file, but missing some number of values, those 
missing values were erased from the row.  With this new implementation the old 
values for those missing statistics will be used.  This case will probably 
never happen in the field.


This addresses bug HIVE-2430.
    https://issues.apache.org/jira/browse/HIVE-2430


Diffs
-----

  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1165899 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
 1165899 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
1165899 
  
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
 1165899 

Diff: https://reviews.apache.org/r/1737/diff


Testing
-------

I added two new stress tests to TestStatsPublisherEnhanced.  I also modified 
one of the tests to reflect the modified behavior described in the Description.

I ran the unit test queries as well.


Thanks,

Kevin

Reply via email to