----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1737/ -----------------------------------------------------------
Review request for hive and Ning Zhang. Summary ------- I considered two different strategies for handling duplicates after removing the primary key from the table. 1) Go back to performing a select, and then updating if a row for the file exists, or inserting a new record otherwise. 2) Always insert records and then during aggregation get the max value for each statistic with a group by on the file name, and then aggregate those statistics. This diff contains the code for option 2. I determined this to be the better option by adding a couple stress tests to TestStatsPublisherEnhanced, and then comparing the run times for the two implementations using derby and MySQL. The two tests checked the performance when inserting a couple hundred rows for each of two files, and inserting several hundred rows, each for a different file. In each case, when i ran the tests on my machine there wasn't much difference for derby, but for MySQL I was seeing both tests run about 100 ms faster for MySQL. I ran both tests several times, to confirm what I was seeing. Note that previously, if statistics were added for a file, and then statistics were added again for that same file, but missing some number of values, those missing values were erased from the row. With this new implementation the old values for those missing statistics will be used. This case will probably never happen in the field. This addresses bug HIVE-2430. https://issues.apache.org/jira/browse/HIVE-2430 Diffs ----- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1165899 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1165899 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 1165899 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java 1165899 Diff: https://reviews.apache.org/r/1737/diff Testing ------- I added two new stress tests to TestStatsPublisherEnhanced. I also modified one of the tests to reflect the modified behavior described in the Description. I ran the unit test queries as well. Thanks, Kevin