[
https://issues.apache.org/jira/browse/HIVE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Hanson updated HIVE-4946:
------------------------------
Description:
In order to prevent a bug, I had to use BytesColumnVector.setVal instead of
BytesColumnVector.setRef when creating the output of all string functions.
These include TRIM/LTRIM/RTRIM/SUBSTR, which can be made faster if they can use
the setRef method instead. That allows them to avoid the cost of copying the
data, by just setting a reference to a string (by setting the byte[] pointer,
start, and length).
As a future performance enhancement, it would be desirable to be able to use
setRef for these functions instead of setVal. But to do that, it is necessary
to be able to mark a BytesColumnVector that is being referenced into so it is
not reclaimed and re-used. So this would require a design and implementation
change to the output column manager used in VectorizationContext.
I'm marking this as "minor" priority because it will result in a small
performance enhancement, most likely. It can be deferred for a while.
The following is an example of a query that will exhibit a bug if setRef is
used instead of setVal in the implementation of trim and concat functions:
select l_shipmode,
rtrim(concat(l_shipmode,' ')) -- incorrect result for this output column
,trim(concat(' ',l_shipmode)) -- requires this line for bug to show up
from lineitem_orc
where l_orderkey = 1;
was:
In order to prevent a bug, I had to use BytesColumnVector.setVal instead of
BytesColumnVector.setRef when creating the output of all string functions.
These include TRIM/LTRIM/RTRIM/SUBSTR, which can be made faster if they can use
the setRef method instead. That allows them to avoid the cost of copying the
data, by just setting a reference to a string (by setting the byte[] pointer,
start, and length).
As a future performance enhancement, it would be desirable to be able to use
setRef for these functions instead of setVal. But to do that, it is necessary
to be able to mark a BytesColumnVector that is being referenced into so it is
not reclaimed and re-used. So this would require a design and implementation
change to the output column manager used in VectorizationContext.
I'm marking this as "minor" priority because it will result in a small
performance enhancement, most likely. It can be deferred for a while.
The following is an example of a query that will exhibit a bug if setRef is
used instead of setVal in the implementation of trim and concat functions:
select l_shipmode,
rtrim(concat(l_shipmode,' ')) -- missing last char
,trim(concat(' ',l_shipmode)) -- requires this line for bug to show up
from lineitem_orc
where l_orderkey = 1;
> Allow prevention of string column re-use for string functions that can set
> results by reference
> -----------------------------------------------------------------------------------------------
>
> Key: HIVE-4946
> URL: https://issues.apache.org/jira/browse/HIVE-4946
> Project: Hive
> Issue Type: Sub-task
> Affects Versions: vectorization-branch
> Reporter: Eric Hanson
> Priority: Minor
>
> In order to prevent a bug, I had to use BytesColumnVector.setVal instead of
> BytesColumnVector.setRef when creating the output of all string functions.
> These include TRIM/LTRIM/RTRIM/SUBSTR, which can be made faster if they can
> use the setRef method instead. That allows them to avoid the cost of copying
> the data, by just setting a reference to a string (by setting the byte[]
> pointer, start, and length).
> As a future performance enhancement, it would be desirable to be able to use
> setRef for these functions instead of setVal. But to do that, it is necessary
> to be able to mark a BytesColumnVector that is being referenced into so it is
> not reclaimed and re-used. So this would require a design and implementation
> change to the output column manager used in VectorizationContext.
> I'm marking this as "minor" priority because it will result in a small
> performance enhancement, most likely. It can be deferred for a while.
> The following is an example of a query that will exhibit a bug if setRef is
> used instead of setVal in the implementation of trim and concat functions:
> select l_shipmode,
> rtrim(concat(l_shipmode,' ')) -- incorrect result for this output column
> ,trim(concat(' ',l_shipmode)) -- requires this line for bug to show up
> from lineitem_orc
> where l_orderkey = 1;
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira