[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035612#comment-13035612 ]
Ning Zhang commented on HIVE-2144: ---------------------------------- Great! I like the idea. One comment about the primary key constraint: I'm not sure if UNIQUE is the standard way to specify primary key constraint. There are people using Oralce/MS SQL sever/Postgres as metastore, we should use a standard way. I think 'id varchar(255) PRIMARY KEY' is more widely supported. Can you double check with mysql and derby? > reduce workload generated by JDBCStatsPublisher > ----------------------------------------------- > > Key: HIVE-2144 > URL: https://issues.apache.org/jira/browse/HIVE-2144 > Project: Hive > Issue Type: Improvement > Reporter: Ning Zhang > Assignee: Tomasz Nykiel > > In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID > was inserted by another task (mostly likely a speculative or previously > failed task). Depending on if the ID is there, an INSERT or UPDATE query was > issues. So there are basically 2x of queries per row inserted into the > intermediate stats table. This workload could be reduced to 1/2 if we insert > it anyway (it is very rare that IDs are duplicated) and use a different SQL > query in the aggregation phase to dedup the ID (e.g., using group-by and > max()). The benefits are that even though the aggregation query is more > expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira