I believe json itself has encoding rules. What i suggest you do is build
your own input format or serde and escape those fieds possibly by
converting them to hex.

On Wednesday, November 23, 2016, Dana Ram Meghwal <dana...@saavn.com> wrote:

> Hey,
> Any leads?
>
> On Tue, Nov 22, 2016 at 5:35 PM, Dana Ram Meghwal <dana...@saavn.com
> <javascript:_e(%7B%7D,'cvml','dana...@saavn.com');>> wrote:
>
>> Hey All,
>>
>> I am using Hive 2.0 with external meta-store on EMR-5.0.0 and TEZ as
>> execution engine.
>> Our data are stored in json format so for serialization and
>> deserialization purpose we are planning to use lazy serde
>> (classname is  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' ).
>>
>> My table definition is
>>
>> CREATE EXTERNAL TABLE IF NOT EXISTS 
>> daily_active_users_summary_json_partition_dt_paths_v1
>> (uid string, city string, user string, songcount string, songid_list
>> array<string>  ) PARTITIONED BY ( dt string)
>>
>>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>>
>>  WITH SERDEPROPERTIES ('paths'='uid,city,user,songcount,songid_list')
>>
>>  LOCATION 's3://<bucketname removed>/users/daily_active_us
>> ers_summary_json_partition_dt';
>>
>>
>> and data look like this---
>>
>> {"uid":"xxxxxxyyyy","listening_user_flag":"non_listening","
>> platform":"android","model":"micromax a110q","aquisition_channel":"o
>> rganic","state":"delhi","app_version":"3.2:","country":"IN","city":"new
>> delhi","new_listening_user_flag":"non_listening","manufactur
>> er":"Micromax","login_mode":"loggedout","new_user_flag":"
>> returning","digital_channel":"Not Source"}
>>
>>
>> Note: I have pasted here one record in table.
>>
>>
>> Now, When I do query
>>
>> select * from daily_active_users_summary_json_partition_dt_paths_v1
>> limit 5;
>>
>>
>> the first field of table takes the complete record and rest of field are
>> showing to be NULL.
>>
>> When I use different serde  'org.apache.hive.hcatalog.data.JsonSerDe'
>>
>> then I can see the above query works fine and able to serialize data
>> perfectly fine. We want to user the lazy serde because our data contains
>> non-utf-8 character and the later serde does not support non-utf-8
>> character serialization/deserialization.
>>
>>
>> Can you please help me solve this, we mostly want to use lazy serde only
>> as we have already experimented with other serde's none of them is working
>> for us Is there any configuration which enable
>> serialization/deserialization while using lazy Serde.
>>
>> Or is there any other serde which can fine process non-utf-8 character in
>> hive-2 and tez.
>>
>> Thank you
>>
>>
>> Best Regards,
>> Dana Ram Meghwal
>> Software Engineer
>> dana...@saavn.com <javascript:_e(%7B%7D,'cvml','dana...@saavn.com');>
>>
>>
>
>
> --
> Dana Ram Meghwal
> Software Engineer
> dana...@saavn.com <javascript:_e(%7B%7D,'cvml','dana...@saavn.com');>
>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Reply via email to