Hive splits/adds rows when outputting dataset with new lines

Maciek Mon, 06 Oct 2014 04:53:00 -0700

Hello,

I've encountered a situation when printing new lines corrupts (multiplies)
the returned dataset.
This seem to be similar to HIVE-3012
<https://issues.apache.org/jira/browse/HIVE-3012> (fixed on 0.11), but as
I'm on Hive 0.13 it's still the case.
Here are the steps to illustrate/reproduce:


1. Fist let'e create table with one row and one column by selecting from
any existing table (substitute ANYTABLE respecitvely):

CREATE TABLE singlerow AS SELECT 'worldofhostels' wordsmerged FROM ANYTABLE
LIMIT 1;

and verify:

SELECT * FROM singlerow;

OK-----------
worldofhostels

Time taken: 0.028 seconds, Fetched: 1 row(s)

All good so far.
2. Now let's introduce newline here by:

SELECT regexp_replace(wordsmerged,'of',"\nof\n") wordsseparate FROM
singlerow;

OK----------

world
of
hostels

Time taken: 6.404 seconds, Fetched: 3 row(s)
and I'm suddenly getting 3 rows now.
3. This is not just for CLI output as when submitting CTAS, it materializes
such corrupted result set:

CREATE TABLE corrupted AS
SELECT regexp_replace(wordsmerged,'of',"\nof\n") wordsseparate, wordsmerged
FROM singlerow;

hive> select * from corrupted;

OK

world NULL
of NULL
hostels worldofhostels

Time taken: 0.029 seconds, Fetched: 3 row(s)
Apparently, the same happens - new table is split into multiple rows with
columns following the one in question (like wordsmerged) become NULLs
Am i doing something wrong here?

Regards,
Maciek

Hive splits/adds rows when outputting dataset with new lines

Reply via email to