[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Tomasz Nykiel (JIRA) Wed, 18 May 2011 13:13:30 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035634#comment-13035634
 ]


Tomasz Nykiel commented on HIVE-2144:
-------------------------------------

Yes, I agree. There are some subtle differences between UNIQUE and PK in Derby 
and MySQL (e.g., in MySQL the unique index allows null values, and in Derby it 
does not. So in general, PK constraint will be more suitable.

CREATE TABLE PARTITION_STAT_TBL ( IDE VARCHAR(255) PRIMARY KEY, ROW_COUNT 
BIGINT ) works for both Derby and MySql.
After a quick check it seems that it's supported by Oracle/MSSQL as well.



> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
> was inserted by another task (mostly likely a speculative or previously 
> failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
> issues. So there are basically 2x of queries per row inserted into the 
> intermediate stats table. This workload could be reduced to 1/2 if we insert 
> it anyway (it is very rare that IDs are duplicated) and use a different SQL 
> query in the aggregation phase to dedup the ID (e.g., using group-by and 
> max()). The benefits are that even though the aggregation query is more 
> expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Reply via email to