[jira] [Updated] (HIVE-11095) SerDeUtils another bug ,when Text is reused

xiaowei wang (JIRA) Thu, 25 Jun 2015 17:28:16 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


xiaowei wang updated HIVE-11095:
--------------------------------
    Description: 
{noformat}
The method transformTextFromUTF8 have a  error bug, It invoke a bad method of 
Text,getBytes()!
The method getBytes of Text returns the raw bytes; however, only data up to 
Text.length is valid.A better way is  use copyBytes()  if you need the returned 
array to be precisely the length of the data.
But the copyBytes is added behind hadoop1. 
{noformat}
How I found this bug？
When i query data from a lzo table ， I found in results ： the length of the 
current row is always largr than the previous row， and sometimes，the current 
row contains the contents of the previous row。 For example ，i execute a sql ,
{code:sql}
select * from web_searchhub where logdate=2015061003
{code}
the result of sql see blow.Notice that ,the second row content contains the 
first row content.
{noformat}
INFO [03:00:05.589] HttpFrontServer::FrontSH 
msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003
INFO [03:00:05.594] <18941e66-9962-44ad-81bc-3519f47ba274> 
session=901,thread=223ession=3151,thread=254 2015061003
{noformat}
The content of origin lzo file content see below ,just 2 rows.
{noformat}
INFO [03:00:05.635] <b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb> 
session=3148,thread=285
INFO [03:00:05.635] HttpFrontServer::FrontSH 
msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
{noformat}
I think this error is caused by the Text reuse,and I found the solutions .
Addicational, table create sql is : 
{code:sql}
CREATE EXTERNAL TABLE `web_searchhub`(
`line` string)
PARTITIONED BY (
`logdate` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '
U0000'
WITH SERDEPROPERTIES (
'serialization.encoding'='GBK')
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
LOCATION
'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
{code}

  was:
The method transformTextFromUTF8 have a  error bug, It invoke a bad method of 
Text,getBytes()!
When i query data from a lzo table ， I found in results ： the length of the 
current row is always largr than the previous row， and sometimes，the current 
row contains the contents of the previous row。 For example ，i execute a sql ,
{code:sql}
select * from web_searchhub where logdate=2015061003
{code}
the result of sql see blow.Notice that ,the second row content contains the 
first row content.
{noformat}
INFO [03:00:05.589] HttpFrontServer::FrontSH 
msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003
INFO [03:00:05.594] <18941e66-9962-44ad-81bc-3519f47ba274> 
session=901,thread=223ession=3151,thread=254 2015061003
{noformat}
The content of origin lzo file content see below ,just 2 rows.
{noformat}
INFO [03:00:05.635] <b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb> 
session=3148,thread=285
INFO [03:00:05.635] HttpFrontServer::FrontSH 
msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
{noformat}
I think this error is caused by the Text reuse,and I found the solutions .
Addicational, table create sql is : 
{code:sql}
CREATE EXTERNAL TABLE `web_searchhub`(
`line` string)
PARTITIONED BY (
`logdate` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '
U0000'
WITH SERDEPROPERTIES (
'serialization.encoding'='GBK')
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
LOCATION
'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
{code}


> SerDeUtils  another bug ,when Text is reused
> --------------------------------------------
>
>                 Key: HIVE-11095
>                 URL: https://issues.apache.org/jira/browse/HIVE-11095
>             Project: Hive
>          Issue Type: Bug
>          Components: API, CLI
>    Affects Versions: 0.14.0, 1.0.0, 1.2.0
>         Environment: Hadoop 2.3.0-cdh5.0.0
> Hive 0.14
>            Reporter: xiaowei wang
>            Assignee: xiaowei wang
>             Fix For: 1.2.0
>
>         Attachments: HIVE-11095.1.patch.txt
>
>
> {noformat}
> The method transformTextFromUTF8 have a  error bug, It invoke a bad method of 
> Text,getBytes()!
> The method getBytes of Text returns the raw bytes; however, only data up to 
> Text.length is valid.A better way is  use copyBytes()  if you need the 
> returned array to be precisely the length of the data.
> But the copyBytes is added behind hadoop1. 
> {noformat}
> How I found this bug？
> When i query data from a lzo table ， I found in results ： the length of the 
> current row is always largr than the previous row， and sometimes，the current 
> row contains the contents of the previous row。 For example ，i execute a sql ,
> {code:sql}
> select * from web_searchhub where logdate=2015061003
> {code}
> the result of sql see blow.Notice that ,the second row content contains the 
> first row content.
> {noformat}
> INFO [03:00:05.589] HttpFrontServer::FrontSH 
> msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003
> INFO [03:00:05.594] <18941e66-9962-44ad-81bc-3519f47ba274> 
> session=901,thread=223ession=3151,thread=254 2015061003
> {noformat}
> The content of origin lzo file content see below ,just 2 rows.
> {noformat}
> INFO [03:00:05.635] <b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb> 
> session=3148,thread=285
> INFO [03:00:05.635] HttpFrontServer::FrontSH 
> msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
> {noformat}
> I think this error is caused by the Text reuse,and I found the solutions .
> Addicational, table create sql is : 
> {code:sql}
> CREATE EXTERNAL TABLE `web_searchhub`(
> `line` string)
> PARTITIONED BY (
> `logdate` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '
> U0000'
> WITH SERDEPROPERTIES (
> 'serialization.encoding'='GBK')
> STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
> OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> LOCATION
> 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11095) SerDeUtils another bug ,when Text is reused

Reply via email to