[ 
https://issues.apache.org/jira/browse/HIVE-16351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Doubrovkine updated HIVE-16351:
--------------------------------------
    Comment: was deleted

(was: Those NULLs are actually from an empty record. Is that right? Should the 
last empty line be ignored since these are text files and if so which 
components responsibility is it? 

{code}
~$ cat /tmp/test.json 
{"text":"foo\nbar","number":123}
{"text":"bar\nfoo","number":345}
~$ perl -pe 'chomp if eof' /tmp/test.json > /tmp/test2.json
~$ cat /tmp/test2.json 
{"text":"foo\nbar","number":123}
{"text":"bar\nfoo","number":345}~$ 
$ hadoop fs -put -f /tmp/test2.json /user/data/test.json
$ hive
hive> SELECT * FROM test;
OK
foo
bar     123
bar
foo     345
{code})

> Hive confused by CR/LFs
> -----------------------
>
>                 Key: HIVE-16351
>                 URL: https://issues.apache.org/jira/browse/HIVE-16351
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Serializers/Deserializers
>    Affects Versions: 1.2.1
>         Environment: Hadoop 2.7.3
>            Reporter: Daniel Doubrovkine
>
> Hive is returning broken data that contains CR/LF.
> {code}
> CREATE DATABASE positron;
> CREATE EXTERNAL TABLE positron.articles (
>   `_id` struct<oid:string>,
>   `channel_id` struct<oid:string>,
>   `exclude_google_news` boolean,
>   `fair_ids` array<map<string,string>>,
>   `hero_section` map<string,string>,
>   `partner_ids` array<map<string,string>>,
>   `description` string,
>   `partner_channel_id` struct<oid:string>,
>   `published` boolean,
>   `published_at` map<string,string>,
>   `slugs` array<string>,
>   `sections` array<map<string,string>>,
>   `thumbnail_image` string,
>   `title` string
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> WITH serdeproperties ('mapping.oid' = '$oid')
> LOCATION '/user/data/positron/articles';
> SELECT
>   a.slugs[SIZE(a.slugs) - 1],
>   a.title,
>   a.thumbnail_image
> FROM
>   positron.articles as a
> WHERE a.published = true
> AND a.hero_section["type"] = "video"
> AND (a.channel_id IS NOT NULL OR a.partner_channel_id IS NOT NULL)
> AND a.thumbnail_image IS NOT NULL;
> {code}
> Note the NULLs below.
> {code}
> astrid-caroline-cole-sneak-peek-realities-by-marc-gumpinger   Sneak peek 
> "Realities" by Marc Gumpinger   
> https://artsy-media-uploads.s3.amazonaws.com/bUb1l_4g6cMhcDxEaPYDxw%2Facc_signature.png
> artsy-editorial-how-art-fairs-expanded-the-contemporary-art-market   The Art 
> Market, Explained: The Rise of the Art Fair   
> https://artsy-media-uploads.s3.amazonaws.com/j8GIeamyufubMBgJFNHbFA%2Fartfairsex.jpg
> nolongercreek  Alexandra Kehayoglou x Artsy: NULL
> No Longer Creek   
> https://d32dm0rphc51dk.cloudfront.net/5oRwy7ysKHohtahIYUTE9Q/larger.jpg NULL
> kukje-gallery-teaser-trailer-kim-yong-ik  Teaser Trailer | Kim Yong-Ik  
> https://artsy-media-uploads.s3.amazonaws.com/mmoZcz0imuUzCavkObKgVQ%2Fkyi+thumbnail.PNG
> {code}
> {code}
> $ hive --version
> Hive 1.1.0-cdh5.6.0
> Subversion 
> file:///data/jenkins/workspace/generic-package-ubuntu64-14-04/CDH5.6.0-Packaging-Hive-2016-01-28_21-19-00/hive-1.1.0+cdh5.6.0+377-1.cdh5.6.0.p0.110~trusty
>  -r Unknown
> Compiled by jenkins on Thu Jan 28 21:35:50 PST 2016
> From source with checksum b4a8fadbcf1ca36d11d91805d3ec2743
> {code}
> What's very interesting is that I am not able to reproduce this locally with 
> the same data with any version of hive. Any help appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to