[ 
https://issues.apache.org/jira/browse/HIVE-5304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-5304:
-----------------------------------

    Description: 
[removed old description]
Hive JDOQL filter pushdown and direct SQL may end up pushing StringCol op 
'SomeString' to underlying SQL datastore. However, the datastore may handle 
these differently based on the encoding and collation used for the columns of 
the database.
So, query results can change depending on the underlying store for the 
metastore, and OS used

I am assuming that byte-order sort if the correct way to order things.
Our MySQL script specifies _bin collation, and Postgres, as far as I see, 
defaults to "C"; both of those are byte-order collations.
However, MySQL by default doesn't use _bin collation, so if database is 
auto-created, the order of things is going to change. 
Derby also uses the non-byte-order by 

drop_partitions_filter.q illustrates this problem. In byte order collation 
(proper way) USA is sorted before Uganda

  was:
Hive uses JDOQL filters to optimize partition retrieval; recently direct SQL 
was added to optimize it further. Both of these methods may end up pushing 
StringCol op 'SomeString' to underlying SQL datastore. Many paths also pushes 
order by-s, although these are not as problematic.
The problem is that different datastores handle string compares differently. 
While testing on Postgres, I see that results in different things, from 
innocent like order changes in "show partitions", to more serious like 
{code}
 alter table ptestfilter drop partition (c>='US', d<='2')
{code}
in drop_partitions_filter.q - in Derby, with which the .q.out file was 
generated, it drops "c=Uganda/d=2"; this also passes on MySQL (I ran tests with 
autocreated db); on Postgres with a db from the script it doesn't.
Looks like we need to enforce collation in partition names and 
part_key_values-es; both in the create scripts, as well as during autocreate 
(via package.jdo?)

EDIT:
also affected are - show indexes. So all names need to be taken care of

Then; describe_comment_nonascii.q fails against MySQL on autocreated db due to 
problems with commends.


> Hive results can depend on metastore's underlying datastore, if autocreate is 
> used
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-5304
>                 URL: https://issues.apache.org/jira/browse/HIVE-5304
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>            Reporter: Sergey Shelukhin
>
> [removed old description]
> Hive JDOQL filter pushdown and direct SQL may end up pushing StringCol op 
> 'SomeString' to underlying SQL datastore. However, the datastore may handle 
> these differently based on the encoding and collation used for the columns of 
> the database.
> So, query results can change depending on the underlying store for the 
> metastore, and OS used
> I am assuming that byte-order sort if the correct way to order things.
> Our MySQL script specifies _bin collation, and Postgres, as far as I see, 
> defaults to "C"; both of those are byte-order collations.
> However, MySQL by default doesn't use _bin collation, so if database is 
> auto-created, the order of things is going to change. 
> Derby also uses the non-byte-order by 
> drop_partitions_filter.q illustrates this problem. In byte order collation 
> (proper way) USA is sorted before Uganda



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to