[ 
https://issues.apache.org/jira/browse/HIVE-5304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793112#comment-13793112
 ] 

Sergey Shelukhin commented on HIVE-5304:
----------------------------------------

[~ashutoshc] I updated the JIRA description with detailed investigation 
results... just fyi

> Hive results can depend on metastore's underlying datastore
> -----------------------------------------------------------
>
>                 Key: HIVE-5304
>                 URL: https://issues.apache.org/jira/browse/HIVE-5304
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>            Reporter: Sergey Shelukhin
>
> [removed old description]
> Hive JDOQL filter pushdown and direct SQL may end up pushing StringCol op 
> 'SomeString' to underlying SQL datastore. However, the datastore may handle 
> these differently based on the encoding and collation used for the columns of 
> the database.
> So, query results can change depending on the underlying store for the 
> metastore and its version.
> drop_partitions_filter.q test illustrates this problem. In byte order 
> collation (proper way) USA is sorted before Uganda, but some collations may 
> do it the other way, causing the test to fail.
> I am assuming that byte-order sort if the correct way to order things.
> Our MySQL script specifies _bin collation, which is byte-order; Postgres 9.1 
> and after, as far as I see, defaults to "C" collation, which is also 
> byte-order.
> Derby seems to use byte-order by default, I didn't spend a lot of time on 
> Derby.
> However, Postgres before 9.1 seems to default to "en_US.UTF8" and there's no 
> way to change column collation in our script if database is already created.
> MySQL by default doesn't use _bin collation (on my machine), so if database 
> is auto-created, the order of things is going to change. 
> I didn't investigate MSSQL or Oracle.
> For now it seems that:
> 1) Auto-create shouldn't be used.
> 2) If old version of postgres (<9.1) is used, the collation should be set 
> properly by whoever issues "create database" (that is not our script).
> 3) We might want to add 'collate "C"' to varchar columns in the postgres 
> script to ensure the correct collation; however, this will break the script 
> for postgres <9.1.
> 4) MSSQL and Oracle might warrant investigation.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to