[ https://issues.apache.org/jira/browse/HIVE-16351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Doubrovkine resolved HIVE-16351. --------------------------------------- Resolution: Duplicate Release Note: Fixed it out. The problem is well described in https://www.phdata.io/hive-corruption-due-to-newlines-and-carriage-returns. If you set hive.query.result.fileformat to SequenceFile the query works as expected since the query output is written in a binary format as opposed to a text format. Dup of https://issues.apache.org/jira/browse/HIVE-1608 > Hive confused by CR/LFs > ----------------------- > > Key: HIVE-16351 > URL: https://issues.apache.org/jira/browse/HIVE-16351 > Project: Hive > Issue Type: Bug > Components: Hive, Serializers/Deserializers > Affects Versions: 1.2.1 > Environment: Hadoop 2.7.3 > Reporter: Daniel Doubrovkine > Attachments: Screen Shot 2017-04-03 at 4.33.23 PM.png > > > Hive is returning broken data that contains CR/LF. > {code} > CREATE DATABASE positron; > CREATE EXTERNAL TABLE positron.articles ( > `_id` struct<oid:string>, > `channel_id` struct<oid:string>, > `exclude_google_news` boolean, > `fair_ids` array<map<string,string>>, > `hero_section` map<string,string>, > `partner_ids` array<map<string,string>>, > `description` string, > `partner_channel_id` struct<oid:string>, > `published` boolean, > `published_at` map<string,string>, > `slugs` array<string>, > `sections` array<map<string,string>>, > `thumbnail_image` string, > `title` string > ) > ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' > WITH serdeproperties ('mapping.oid' = '$oid') > LOCATION '/user/data/positron/articles'; > SELECT > a.slugs[SIZE(a.slugs) - 1], > a.title, > a.thumbnail_image > FROM > positron.articles as a > WHERE a.published = true > AND a.hero_section["type"] = "video" > AND (a.channel_id IS NOT NULL OR a.partner_channel_id IS NOT NULL) > AND a.thumbnail_image IS NOT NULL; > {code} > Note the NULLs below. The `title` field has a CR/LF in it. > {code} > astrid-caroline-cole-sneak-peek-realities-by-marc-gumpinger Sneak peek > "Realities" by Marc Gumpinger > https://artsy-media-uploads.s3.amazonaws.com/bUb1l_4g6cMhcDxEaPYDxw%2Facc_signature.png > artsy-editorial-how-art-fairs-expanded-the-contemporary-art-market The Art > Market, Explained: The Rise of the Art Fair > https://artsy-media-uploads.s3.amazonaws.com/j8GIeamyufubMBgJFNHbFA%2Fartfairsex.jpg > nolongercreek Alexandra Kehayoglou x Artsy: NULL > No Longer Creek > https://d32dm0rphc51dk.cloudfront.net/5oRwy7ysKHohtahIYUTE9Q/larger.jpg NULL > kukje-gallery-teaser-trailer-kim-yong-ik Teaser Trailer | Kim Yong-Ik > https://artsy-media-uploads.s3.amazonaws.com/mmoZcz0imuUzCavkObKgVQ%2Fkyi+thumbnail.PNG > {code} > {code} > $ hive --version > Hive 1.1.0-cdh5.6.0 > Subversion > file:///data/jenkins/workspace/generic-package-ubuntu64-14-04/CDH5.6.0-Packaging-Hive-2016-01-28_21-19-00/hive-1.1.0+cdh5.6.0+377-1.cdh5.6.0.p0.110~trusty > -r Unknown > Compiled by jenkins on Thu Jan 28 21:35:50 PST 2016 > From source with checksum b4a8fadbcf1ca36d11d91805d3ec2743 > {code} > What's very interesting is that I am not able to reproduce this locally with > the same data with any version of hive. It only happens in our Cloudera > cluster. Any help appreciated. -- This message was sent by Atlassian JIRA (v6.3.15#6346)