[jira] [Commented] (HIVE-11023) Disable directSQL if datanucleus.identifierFactory = datanucleus2

Sushanth Sowmyan (JIRA) Thu, 18 Jun 2015 14:29:28 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592540#comment-14592540
 ]


Sushanth Sowmyan commented on HIVE-11023:
-----------------------------------------

[~xuefuz] : Yes, this will happen to all releases with directSql - which means 
anything past 0.12, I think. That said, the number of installations that 
override this parameter should hopefully be minimal.

If people are using datanucleus2 as their identifierFactory version, then:
a) They should disable directSql (can be done by conf parameter, does not need 
this code fix - the code fix simply automates that for current and future 
releases)
b) They should retain that identifierFactory - a mixed metastore with both is 
bad.
c) Once we have a way of migrating them, as with HIVE-11039 filed, we should 
migrate them out of it.

This parameter should never have been a part of hive-site.xml, I think, since 
it's dangerous if a user changes it. A "datanucleus1" installation changing the 
parameter to "datanucleus2" or vice-versa can result in metadata corruption for 
us.


> Disable directSQL if datanucleus.identifierFactory = datanucleus2
> -----------------------------------------------------------------
>
>                 Key: HIVE-11023
>                 URL: https://issues.apache.org/jira/browse/HIVE-11023
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 1.3.0, 1.2.1, 2.0.0
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>            Priority: Critical
>             Fix For: 1.2.1
>
>         Attachments: HIVE-11023.patch
>
>
> We hit an interesting bug in a case where datanucleus.identifierFactory = 
> datanucleus2 .
> The problem is that directSql handgenerates SQL strings assuming 
> "datanucleus1" naming scheme. If a user has their metastore JDO managed by 
> datanucleus.identifierFactory = datanucleus2 , the SQL strings we generate 
> are incorrect.
> One simple example of what this results in is the following: whenever DN 
> persists a field which is held as a List<T>, it winds up storing each T as a 
> separate line in the appropriate mapping table, and has a column called 
> INTEGER_IDX, which holds the position in the list. Then, upon reading, it 
> automatically reads all relevant lines with an ORDER BY INTEGER_IDX, which 
> results in the list retaining its order. In DN2 naming scheme, the column is 
> called IDX, instead of INTEGER_IDX. If the user has run appropriate metatool 
> upgrade scripts, it is highly likely that they have both columns, INTEGER_IDX 
> and IDX.
> Whenever they use JDO, such as with all writes, it will then use the IDX 
> field, and when they do any sort of optimized reads, such as through 
> directSQL, it will ORDER BY INTEGER_IDX.
> An immediate danger is seen when we consider that the schema of a table is 
> stored as a List<FieldSchema> , and while IDX has 0,1,2,3,... , INTEGER_IDX 
> will contain 0,0,0,0,... and thus, any attempt to describe the table or fetch 
> schema for the table can come up mixed up in the table's native hashing 
> order, rather than sorted by the index.
> This can then result in schema ordering being different from the actual 
> table. For eg:, if a user has a (a:int,b:string,c:string), a describe on this 
> may return (c:string, a:int, b: string), and thus, queries which are 
> inserting after selecting from another table can have ClassCastExceptions 
> when trying to insert data in the wong order - this is how we discovered this 
> bug. This problem, however, can be far worse, if there are no type problems - 
> it is possible, for eg., that if a,b&c were all strings, that that insert 
> query would succeed but mix up the order, which then results in user table 
> data being mixed up. This has the potential to be very bad.
> We should write a tool to help convert metastores that use "datanucleus2" to 
> "datanucleus1"(more difficult, needs more one-time testing) or change 
> directSql to support both(easier to code, but increases test-coverage matrix 
> significantly and we should really then be testing against both schemes). But 
> in the short term, we should disable directSql if we see that the 
> identifierfactory is "datanucleus2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11023) Disable directSQL if datanucleus.identifierFactory = datanucleus2

Reply via email to