RE: Loading data containing newlines

Gerber, Bryan W Tue, 12 Jan 2016 09:59:26 -0800

From that wiki:
"This SerDe works for most CSV data, but does not handle embedded newlines."

The Hive SerDe interface is all downstream of the TextInputFormat, which has 
already split records by newlines.  In theory you can give it a different line 
delimiter, but Hive 1.2.1 does not support it: "FAILED: SemanticException 3:20 
LINES TERMINATED BY only supports newline '\n' right now."

From: Alexander Pivovarov [mailto:apivova...@gmail.com]
Sent: Tuesday, January 12, 2016 9:52 AM
To: user@hive.apache.org
Subject: Re: Loading data containing newlines

Try CSV serde. It should correctly parse quoted field value having newline 
inside
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

Hadoop should automatically read bz2 files

On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W 
<bryan.ger...@pnnl.gov<mailto:bryan.ger...@pnnl.gov>> wrote:
We are attempting to load CSV text files (compressed to bz2) containing 
newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format 
tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking them 
to condition the data.

A few days of research has us ready to implement custom  input/output formats 
to handle the ingest.  Any other suggestions that may be less effort with low 
impact to load times?

Thanks,
Bryan G.

RE: Loading data containing newlines

Reply via email to