[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039417#comment-13039417 ]
Tomasz Nykiel commented on HIVE-2144: ------------------------------------- IMPORTANT NOTE! Before deployment, the primary key constraint needs to be added manually on the ID column of PARTITION_STAT_TBL, if the table already exists. Otherwise, the statistics might be duplicated for some entries, and the aggregated statistics will be silently incorrect. If the table does not exist, it will be created in the proper format. > reduce workload generated by JDBCStatsPublisher > ----------------------------------------------- > > Key: HIVE-2144 > URL: https://issues.apache.org/jira/browse/HIVE-2144 > Project: Hive > Issue Type: Improvement > Reporter: Ning Zhang > Assignee: Tomasz Nykiel > Fix For: 0.8.0 > > Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch > > > In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID > was inserted by another task (mostly likely a speculative or previously > failed task). Depending on if the ID is there, an INSERT or UPDATE query was > issues. So there are basically 2x of queries per row inserted into the > intermediate stats table. This workload could be reduced to 1/2 if we insert > it anyway (it is very rare that IDs are duplicated) and use a different SQL > query in the aggregation phase to dedup the ID (e.g., using group-by and > max()). The benefits are that even though the aggregation query is more > expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira