[ 
https://issues.apache.org/jira/browse/HIVE-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829976#comment-13829976
 ] 

Lefty Leverenz edited comment on HIVE-5871 at 9/10/14 4:47 AM:
---------------------------------------------------------------

This implementation mainly relies on LazySimpleSerDe for serialization and 
deserialization. I added some methods to LazyStruct to parse a row delimited by 
multiple-character string. Another difference from LazySimpleSerDe is that 
MultiDelimitSerDe doesn't use Base64 to encode binary fields in serialization. 
Because the encoded string may interfere with the delimiter. I also modified 
LazyBinary, so that when it deserializes a binary field and is  unable to 
Base64 decode the field, it just keeps the data unchanged. A simple use case is 
as follow:

create table test (id string,hivearray array<binary>,hivemap map<string,int>) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH 
SERDEPROPERTIES 
("field.delimited"="[,]","collection.delimited"=":","mapkey.delimited"="@");

where field.delimited is the multiple-char field delimiter. 
collection.delimited is the delimiter for collection items. mapkey.delimited is 
the delimiter for  keys and values in maps. We currently don't support 
multiple-char for these two delimiters.

<Edited 10/Sep/14 on behalf of Rui Li>  This comment's example differs from the 
final version of the patch.  See the description above for an accurate example, 
and note that the SERDEPROPERTIES are *.delim rather than *.delimited.


was (Author: lirui):
This implementation mainly relies on LazySimpleSerDe for serialization and 
deserialization. I added some methods to LazyStruct to parse a row delimited by 
multiple-character string. Another difference from LazySimpleSerDe is that 
MultiDelimitSerDe doesn't use Base64 to encode binary fields in serialization. 
Because the encoded string may interfere with the delimiter. I also modified 
LazyBinary, so that when it deserializes a binary field and is  unable to 
Base64 decode the field, it just keeps the data unchanged. A simple use case is 
as follow:

create table test (id string,hivearray array<binary>,hivemap map<string,int>) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH 
SERDEPROPERTIES 
("field.delimited"="[,]","collection.delimited"=":","mapkey.delimited"="@");

where field.delimited is the multiple-char field delimiter. 
collection.delimited is the delimiter for collection items. mapkey.delimited is 
the delimiter for  keys and values in maps. We currently don't support 
multiple-char for these two delimiters.

> Use multiple-characters as field delimiter
> ------------------------------------------
>
>                 Key: HIVE-5871
>                 URL: https://issues.apache.org/jira/browse/HIVE-5871
>             Project: Hive
>          Issue Type: Improvement
>          Components: Contrib
>    Affects Versions: 0.12.0
>            Reporter: Rui Li
>            Assignee: Rui Li
>              Labels: TODOC14
>             Fix For: 0.14.0
>
>         Attachments: HIVE-5871.2.patch, HIVE-5871.3.patch, HIVE-5871.4.patch, 
> HIVE-5871.5.patch, HIVE-5871.6.patch, HIVE-5871.patch
>
>
> By default, hive only allows user to use single character as field delimiter. 
> Although there's RegexSerDe to specify multiple-character delimiter, it can 
> be daunting to use, especially for amateurs.
> The patch adds a new SerDe named MultiDelimitSerDe. With MultiDelimitSerDe, 
> users can specify a multiple-character field delimiter when creating tables, 
> in a way most similar to typical table creations. For example:
> {code}
> create table test (id string,hivearray array<binary>,hivemap map<string,int>) 
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' 
> WITH SERDEPROPERTIES 
> ("field.delim"="[,]","collection.delim"=":","mapkey.delim"="@");
> {code}
> where {{field.delim}} is the field delimiter, {{collection.delim}} and 
> {{mapkey.delim}} is the delimiter for collection items and key value pairs, 
> respectively. Among these delimiters, {{field.delim}} is mandatory and can be 
> of multiple characters, while {{collection.delim}} and {{mapkey.delim}} is 
> optional and only support single character.
> To use MultiDelimitSerDe, you have to add the hive-contrib jar to the class 
> path, e.g. with the {{add jar}} command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to