Hi Ankit, I know your problem because I had to deal with a thorn 'þ' separated file too. Hive ,so far, cannot handle multibyte separators so I turned to the custom SerDe option myself. If you manage to capture the 'þ' in the regex you could try
I tried:
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)")
'þ' is recognized by 'þ' in my case, but this regex was too greedy. In the
end I had to regex all the fields in between the separators and that was so
complicated that I wrote a MR job to replace the 'þ' by the '~' which hive
accepts as a field separator (ROW FORMAT DELIMITED FIELDS TERMINATED BY
'~'.
I turned to another solution, and happy I did. Keep us posted if you find
another way.
Jasper
2011/5/8 ankit bhatnagar <[email protected]>
> Hi
>
> I am facing a weird issue with the file parsing. My log files have a thorn
> 'þ' as separator.
> I tried writing a test case for deserializer and kind of confused by the
> fact it works fine as I pass the line to the deserializer, however when i
> run it on hive the line is not split into columns and table inside hive has
> thorn as it is.
>
> Any help would be appreciated.
>
> Thanks
> Ankit
>
--
Kind Regards \ Met Vriendelijke Groet,
Jasper Knulst
BI Consultant
VLC Den Haag
Gildeweg 5B
2632 BD Nootdorp
M: +31 (0)6 19 66 75 11
T: +31 (0)15 764 07 50
------------------------------------------------------------
Skype: jasper_knulst_vlc
<<image001.gif>>
