[ https://issues.apache.org/jira/browse/HIVE-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592540#comment-14592540 ]
Sushanth Sowmyan commented on HIVE-11023: ----------------------------------------- [~xuefuz] : Yes, this will happen to all releases with directSql - which means anything past 0.12, I think. That said, the number of installations that override this parameter should hopefully be minimal. If people are using datanucleus2 as their identifierFactory version, then: a) They should disable directSql (can be done by conf parameter, does not need this code fix - the code fix simply automates that for current and future releases) b) They should retain that identifierFactory - a mixed metastore with both is bad. c) Once we have a way of migrating them, as with HIVE-11039 filed, we should migrate them out of it. This parameter should never have been a part of hive-site.xml, I think, since it's dangerous if a user changes it. A "datanucleus1" installation changing the parameter to "datanucleus2" or vice-versa can result in metadata corruption for us. > Disable directSQL if datanucleus.identifierFactory = datanucleus2 > ----------------------------------------------------------------- > > Key: HIVE-11023 > URL: https://issues.apache.org/jira/browse/HIVE-11023 > Project: Hive > Issue Type: Bug > Components: Metastore > Affects Versions: 1.3.0, 1.2.1, 2.0.0 > Reporter: Sushanth Sowmyan > Assignee: Sushanth Sowmyan > Priority: Critical > Fix For: 1.2.1 > > Attachments: HIVE-11023.patch > > > We hit an interesting bug in a case where datanucleus.identifierFactory = > datanucleus2 . > The problem is that directSql handgenerates SQL strings assuming > "datanucleus1" naming scheme. If a user has their metastore JDO managed by > datanucleus.identifierFactory = datanucleus2 , the SQL strings we generate > are incorrect. > One simple example of what this results in is the following: whenever DN > persists a field which is held as a List<T>, it winds up storing each T as a > separate line in the appropriate mapping table, and has a column called > INTEGER_IDX, which holds the position in the list. Then, upon reading, it > automatically reads all relevant lines with an ORDER BY INTEGER_IDX, which > results in the list retaining its order. In DN2 naming scheme, the column is > called IDX, instead of INTEGER_IDX. If the user has run appropriate metatool > upgrade scripts, it is highly likely that they have both columns, INTEGER_IDX > and IDX. > Whenever they use JDO, such as with all writes, it will then use the IDX > field, and when they do any sort of optimized reads, such as through > directSQL, it will ORDER BY INTEGER_IDX. > An immediate danger is seen when we consider that the schema of a table is > stored as a List<FieldSchema> , and while IDX has 0,1,2,3,... , INTEGER_IDX > will contain 0,0,0,0,... and thus, any attempt to describe the table or fetch > schema for the table can come up mixed up in the table's native hashing > order, rather than sorted by the index. > This can then result in schema ordering being different from the actual > table. For eg:, if a user has a (a:int,b:string,c:string), a describe on this > may return (c:string, a:int, b: string), and thus, queries which are > inserting after selecting from another table can have ClassCastExceptions > when trying to insert data in the wong order - this is how we discovered this > bug. This problem, however, can be far worse, if there are no type problems - > it is possible, for eg., that if a,b&c were all strings, that that insert > query would succeed but mix up the order, which then results in user table > data being mixed up. This has the potential to be very bad. > We should write a tool to help convert metastores that use "datanucleus2" to > "datanucleus1"(more difficult, needs more one-time testing) or change > directSql to support both(easier to code, but increases test-coverage matrix > significantly and we should really then be testing against both schemes). But > in the short term, we should disable directSql if we see that the > identifierfactory is "datanucleus2" -- This message was sent by Atlassian JIRA (v6.3.4#6332)