[ 
https://issues.apache.org/jira/browse/HIVE-16351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Doubrovkine updated HIVE-16351:
--------------------------------------
    Description: 
Hive is returning broken data that contains CR/LF.

{code}
CREATE DATABASE positron;

CREATE EXTERNAL TABLE positron.articles (
  `_id` struct<oid:string>,
  `channel_id` struct<oid:string>,
  `exclude_google_news` boolean,
  `fair_ids` array<map<string,string>>,
  `hero_section` map<string,string>,
  `partner_ids` array<map<string,string>>,
  `description` string,
  `partner_channel_id` struct<oid:string>,
  `published` boolean,
  `published_at` map<string,string>,
  `slugs` array<string>,
  `sections` array<map<string,string>>,
  `thumbnail_image` string,
  `title` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH serdeproperties ('mapping.oid' = '$oid')
LOCATION '/user/data/positron/articles';

SELECT
  a.slugs[SIZE(a.slugs) - 1],
  a.title,
  a.thumbnail_image
FROM
  positron.articles as a
WHERE a.published = true
AND a.hero_section["type"] = "video"
AND (a.channel_id IS NOT NULL OR a.partner_channel_id IS NOT NULL)
AND a.thumbnail_image IS NOT NULL;
{code}

{code}
astrid-caroline-cole-sneak-peek-realities-by-marc-gumpinger   Sneak peek 
"Realities" by Marc Gumpinger   
https://artsy-media-uploads.s3.amazonaws.com/bUb1l_4g6cMhcDxEaPYDxw%2Facc_signature.png
artsy-editorial-how-art-fairs-expanded-the-contemporary-art-market   The Art 
Market, Explained: The Rise of the Art Fair   
https://artsy-media-uploads.s3.amazonaws.com/j8GIeamyufubMBgJFNHbFA%2Fartfairsex.jpg
nolongercreek  Alexandra Kehayoglou x Artsy: NULL
No Longer Creek   
https://d32dm0rphc51dk.cloudfront.net/5oRwy7ysKHohtahIYUTE9Q/larger.jpg NULL
kukje-gallery-teaser-trailer-kim-yong-ik  Teaser Trailer | Kim Yong-Ik  
https://artsy-media-uploads.s3.amazonaws.com/mmoZcz0imuUzCavkObKgVQ%2Fkyi+thumbnail.PNG
{code}

{code}
$ hive --version
Hive 1.1.0-cdh5.6.0
Subversion 
file:///data/jenkins/workspace/generic-package-ubuntu64-14-04/CDH5.6.0-Packaging-Hive-2016-01-28_21-19-00/hive-1.1.0+cdh5.6.0+377-1.cdh5.6.0.p0.110~trusty
 -r Unknown
Compiled by jenkins on Thu Jan 28 21:35:50 PST 2016
>From source with checksum b4a8fadbcf1ca36d11d91805d3ec2743
{code}

What's very interesting is that I am not able to reproduce this locally with 
the same data with any version of hive. Any help appreciated.

  was:
>From https://github.com/rcongiu/Hive-JSON-Serde/issues/65

This happens with both JSON and MongoDB connector Serde, so I don't believe 
this is a Serde bug.

Using 
http://www.congiu.net/hive-json-serde/1.3.6/cdh4/json-serde-1.3.6-jar-with-dependencies.jar
 placed into /usr/local/Cellar/apache-hive-1.2.1/lib

A dummy test.json with a CR/LF

{code}
$ cat /tmp/test.json
{"text":"foo\nbar","number":123}

$ hadoop fs -mkdir /user/data

$ hadoop fs -put -f /tmp/test.json /user/data/test.json

$ hive
hive> CREATE DATABASE test;

hive> CREATE EXTERNAL TABLE test ( text string )
    > ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    > LOCATION '/user/data';

hive> SELECT * FROM test;

foo
bar     123
NULL    NULL
{code}

You can see how that's totally wrong, there's only one row of data.


> Hive confused by CR/LFs
> -----------------------
>
>                 Key: HIVE-16351
>                 URL: https://issues.apache.org/jira/browse/HIVE-16351
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Serializers/Deserializers
>    Affects Versions: 1.2.1
>         Environment: Hadoop 2.7.3
>            Reporter: Daniel Doubrovkine
>
> Hive is returning broken data that contains CR/LF.
> {code}
> CREATE DATABASE positron;
> CREATE EXTERNAL TABLE positron.articles (
>   `_id` struct<oid:string>,
>   `channel_id` struct<oid:string>,
>   `exclude_google_news` boolean,
>   `fair_ids` array<map<string,string>>,
>   `hero_section` map<string,string>,
>   `partner_ids` array<map<string,string>>,
>   `description` string,
>   `partner_channel_id` struct<oid:string>,
>   `published` boolean,
>   `published_at` map<string,string>,
>   `slugs` array<string>,
>   `sections` array<map<string,string>>,
>   `thumbnail_image` string,
>   `title` string
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> WITH serdeproperties ('mapping.oid' = '$oid')
> LOCATION '/user/data/positron/articles';
> SELECT
>   a.slugs[SIZE(a.slugs) - 1],
>   a.title,
>   a.thumbnail_image
> FROM
>   positron.articles as a
> WHERE a.published = true
> AND a.hero_section["type"] = "video"
> AND (a.channel_id IS NOT NULL OR a.partner_channel_id IS NOT NULL)
> AND a.thumbnail_image IS NOT NULL;
> {code}
> {code}
> astrid-caroline-cole-sneak-peek-realities-by-marc-gumpinger   Sneak peek 
> "Realities" by Marc Gumpinger   
> https://artsy-media-uploads.s3.amazonaws.com/bUb1l_4g6cMhcDxEaPYDxw%2Facc_signature.png
> artsy-editorial-how-art-fairs-expanded-the-contemporary-art-market   The Art 
> Market, Explained: The Rise of the Art Fair   
> https://artsy-media-uploads.s3.amazonaws.com/j8GIeamyufubMBgJFNHbFA%2Fartfairsex.jpg
> nolongercreek  Alexandra Kehayoglou x Artsy: NULL
> No Longer Creek   
> https://d32dm0rphc51dk.cloudfront.net/5oRwy7ysKHohtahIYUTE9Q/larger.jpg NULL
> kukje-gallery-teaser-trailer-kim-yong-ik  Teaser Trailer | Kim Yong-Ik  
> https://artsy-media-uploads.s3.amazonaws.com/mmoZcz0imuUzCavkObKgVQ%2Fkyi+thumbnail.PNG
> {code}
> {code}
> $ hive --version
> Hive 1.1.0-cdh5.6.0
> Subversion 
> file:///data/jenkins/workspace/generic-package-ubuntu64-14-04/CDH5.6.0-Packaging-Hive-2016-01-28_21-19-00/hive-1.1.0+cdh5.6.0+377-1.cdh5.6.0.p0.110~trusty
>  -r Unknown
> Compiled by jenkins on Thu Jan 28 21:35:50 PST 2016
> From source with checksum b4a8fadbcf1ca36d11d91805d3ec2743
> {code}
> What's very interesting is that I am not able to reproduce this locally with 
> the same data with any version of hive. Any help appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to