[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038932#comment-13038932 ]
Hudson commented on HIVE-2144: ------------------------------ Integrated in Hive-trunk-h0.21 #748 (See [https://builds.apache.org/hudson/job/Hive-trunk-h0.21/748/]) HIVE-2144. reduce workload generated by JDBCStatsPublisher (Tomasz Nykiel via Ning Zhang) nzhang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1127229 Files : * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java > reduce workload generated by JDBCStatsPublisher > ----------------------------------------------- > > Key: HIVE-2144 > URL: https://issues.apache.org/jira/browse/HIVE-2144 > Project: Hive > Issue Type: Improvement > Reporter: Ning Zhang > Assignee: Tomasz Nykiel > Fix For: 0.8.0 > > Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch > > > In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID > was inserted by another task (mostly likely a speculative or previously > failed task). Depending on if the ID is there, an INSERT or UPDATE query was > issues. So there are basically 2x of queries per row inserted into the > intermediate stats table. This workload could be reduced to 1/2 if we insert > it anyway (it is very rare that IDs are duplicated) and use a different SQL > query in the aggregation phase to dedup the ID (e.g., using group-by and > max()). The benefits are that even though the aggregation query is more > expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira