I'm new to hive and I'm having an issue loading a simple set of data via regex.

I have a data file called test.txt that contains the following:

TESTONE-1
TESTTWO-2
TESTTHREE-3
TESTFOUR-4
TESTFIVE-5

I have this hive script:

hive> CREATE TABLE test
> (
>  field_1 STRING
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES
> (
>  "input.regex" = "([^ ]*)",
>  "output.regex" = "%1$s"
> )
> STORED AS TEXTFILE;
Found class for org.apache.hadoop.hive.contrib.serde2.RegexSerDe
OK
Time taken: 0.064 seconds

hive> LOAD DATA LOCAL INPATH '/home/hadoop/test' OVERWRITE INTO TABLE test;
Copying data from file:/home/hadoop/test
Loading data to table test
OK
Time taken: 0.213 seconds

hive> SELECT * FROM test LIMIT 10;
OK
TESTONE-1
TESTTWO-2
TESTTHREE-3
TESTFOUR-4
TESTFIVE-5
Time taken: 0.153 seconds

Which produces the expected output.

When I alter the hive script to include two fields, I get all null values:

hive> CREATE TABLE test
> (
>  field_1 STRING,
>  field_2 STRING
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES
> (
>  "input.regex" = "([a-z,A-Z]*)(-\d*)",
>  "output.regex" = "%1$s %2$s"
> )
> STORED AS TEXTFILE;
Found class for org.apache.hadoop.hive.contrib.serde2.RegexSerDe
OK
Time taken: 0.025 seconds

hive> LOAD DATA LOCAL INPATH '/home/hadoop/test' OVERWRITE INTO TABLE test;
Copying data from file:/home/hadoop/test
Loading data to table test
OK
Time taken: 0.187 seconds

hive> SELECT * FROM test LIMIT 10;
OK
NULL    NULL
NULL    NULL
NULL    NULL
NULL    NULL
NULL    NULL
Time taken: 0.162 seconds

I've checked the regular expression against http://regexpal.com/ and it seems to check out. I think there may be an issue with SerDe, but I don't know how to go about trouble shooting it.

I'm running this on Amazon's Elastic MapReduce

Any help is appreciated.

-Sal

Reply via email to