-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/
-----------------------------------------------------------

Review request for hive.


Summary
-------

Currently, the JDBCStatsPublisher executes two queries per inserted row of 
statistics, first query to check if the ID was inserted by another task, and 
second query to insert a new or update the existing row.
The latter occurs very rarely, since duplicates most likely originate from 
speculative failed tasks.

Currently the schema of the stat table is the following:

PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have 
any integrity constraints declared.

We amend it to:

PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

HIVE-2144 improves on performance by greedily performing the insertion 
statement.
Then instead of executing two queries per row inserted, we can execute one 
INSERT query.
In the case primary key constraint violation, we perform a single UPDATE query.
The UPDATE query needs to check the condition, if the currently inserted stats 
are "newer" then the ones already in the table.


This addresses bug HIVE-2144.
    https://issues.apache.org/jira/browse/HIVE-2144


Diffs
-----

  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1125140 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
PRE-CREATION 

Diff: https://reviews.apache.org/r/765/diff


Testing
-------

TestStatsPublisher JUnit test:
- basic behaviour
- multiple updates
- cleanup of the statistics table after aggregation

Standalone testing on the cluster.
- insert/analyze queries over non-partitioned/partitioned tables

NOTE. For the correct behaviour, the primary_key index needs to be created, or 
the PARTITION_STAT_TABLE table dropped - which triggers creation of the table 
with the constraint declared.


Thanks,

Tomasz

Reply via email to