-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1737/
-----------------------------------------------------------
Review request for hive and Ning Zhang.
Summary
-------
I considered two different strategies for handling duplicates after removing
the primary key from the table.
1) Go back to performing a select, and then updating if a row for the file
exists, or inserting a new record otherwise.
2) Always insert records and then during aggregation get the max value for each
statistic with a group by on the file name, and then aggregate those statistics.
This diff contains the code for option 2. I determined this to be the better
option by adding a couple stress tests to TestStatsPublisherEnhanced, and then
comparing the run times for the two implementations using derby and MySQL. The
two tests checked the performance when inserting a couple hundred rows for each
of two files, and inserting several hundred rows, each for a different file.
In each case, when i ran the tests on my machine there wasn't much difference
for derby, but for MySQL I was seeing both tests run about 100 ms faster for
MySQL. I ran both tests several times, to confirm what I was seeing.
Note that previously, if statistics were added for a file, and then statistics
were added again for that same file, but missing some number of values, those
missing values were erased from the row. With this new implementation the old
values for those missing statistics will be used. This case will probably
never happen in the field.
This addresses bug HIVE-2430.
https://issues.apache.org/jira/browse/HIVE-2430
Diffs
-----
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
1165899
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
1165899
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
1165899
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
1165899
Diff: https://reviews.apache.org/r/1737/diff
Testing
-------
I added two new stress tests to TestStatsPublisherEnhanced. I also modified
one of the tests to reflect the modified behavior described in the Description.
I ran the unit test queries as well.
Thanks,
Kevin