[ 
https://issues.apache.org/jira/browse/HIVE-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13823270#comment-13823270
 ] 

Sergey Shelukhin commented on HIVE-5817:
----------------------------------------

I will be gone for a week shortly, so if someone decides to take over... I 
examined two options:
1) Making the assumption true (all column names unique).
2) Adding lineage information to column map.
I was previously unfamiliar with either part of the codebase so I may be 
missing some important part, feel free to comment or take over.

For (1), I have a partial patch. Unfortunately, many places in hive explicitly 
or implicitly make assumptions about column names generated by semanticanalyzer 
or other operators. I've fixed a few places, but those that remain were pretty 
tough to figure out, patch needs more work. Also, patch will be quite "epic" in 
its impact relative to the problem, so it's the less-preferred approach.
For (2), adding lineage is easy (output column names are indeed unique within 
one operator, operator ID can be used and it can be retrieved by tag during the 
operator, or maps could be scoped to each operator, so an operator would be 
initialized with "input maps" coming from "output maps" from its parents, and 
in turn generate the "output map"), but the problem is on retrieve side, 
because anyone retrieving the column would have to know the lineage of what 
he's retrieving (from which parent it's coming). 
I've started studying the code of operators to see what kind of assumptions can 
be made in all the places this map is accessed, esp. e.g. getVectorExpression 
recursive call thru getCustomUDFExpression, when expressions for UDF parameters 
are retrieved. It's not quite clear how to get lineage for these parameters and 
whether it's guaranteed to be from the same parent in any given processOp call. 
But this seems to be a promising approach.


> column name to index mapping in VectorizationContext is broken
> --------------------------------------------------------------
>
>                 Key: HIVE-5817
>                 URL: https://issues.apache.org/jira/browse/HIVE-5817
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>
> Columns coming from different operators may have the same internal names 
> ("_colNN"). There exists a query in the form {{select b.cb, a.ca from a JOIN 
> b ON ... JOIN x ON ...;}}  (distilled from a more complex query), which runs 
> ok w/o vectorization. With vectorization, it will run ok for most ca, but for 
> some ca it will fail (or can probably return incorrect results). That is 
> because when building column-to-VRG-index map in VectorizationContext, 
> internal column name for ca that the first map join operator adds to the 
> mapping may be the same as internal name for cb that the 2nd one tries to 
> add. 2nd VMJ doesn't add it (see code in ctor), and when it's time for it to 
> output stuff, it retrieves wrong index from the map by name, and then wrong 
> vector from VRG.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to