I believe json itself has encoding rules. What i suggest you do is build your own input format or serde and escape those fieds possibly by converting them to hex.
On Wednesday, November 23, 2016, Dana Ram Meghwal <dana...@saavn.com> wrote: > Hey, > Any leads? > > On Tue, Nov 22, 2016 at 5:35 PM, Dana Ram Meghwal <dana...@saavn.com > <javascript:_e(%7B%7D,'cvml','dana...@saavn.com');>> wrote: > >> Hey All, >> >> I am using Hive 2.0 with external meta-store on EMR-5.0.0 and TEZ as >> execution engine. >> Our data are stored in json format so for serialization and >> deserialization purpose we are planning to use lazy serde >> (classname is 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' ). >> >> My table definition is >> >> CREATE EXTERNAL TABLE IF NOT EXISTS >> daily_active_users_summary_json_partition_dt_paths_v1 >> (uid string, city string, user string, songcount string, songid_list >> array<string> ) PARTITIONED BY ( dt string) >> >> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' >> >> WITH SERDEPROPERTIES ('paths'='uid,city,user,songcount,songid_list') >> >> LOCATION 's3://<bucketname removed>/users/daily_active_us >> ers_summary_json_partition_dt'; >> >> >> and data look like this--- >> >> {"uid":"xxxxxxyyyy","listening_user_flag":"non_listening"," >> platform":"android","model":"micromax a110q","aquisition_channel":"o >> rganic","state":"delhi","app_version":"3.2:","country":"IN","city":"new >> delhi","new_listening_user_flag":"non_listening","manufactur >> er":"Micromax","login_mode":"loggedout","new_user_flag":" >> returning","digital_channel":"Not Source"} >> >> >> Note: I have pasted here one record in table. >> >> >> Now, When I do query >> >> select * from daily_active_users_summary_json_partition_dt_paths_v1 >> limit 5; >> >> >> the first field of table takes the complete record and rest of field are >> showing to be NULL. >> >> When I use different serde 'org.apache.hive.hcatalog.data.JsonSerDe' >> >> then I can see the above query works fine and able to serialize data >> perfectly fine. We want to user the lazy serde because our data contains >> non-utf-8 character and the later serde does not support non-utf-8 >> character serialization/deserialization. >> >> >> Can you please help me solve this, we mostly want to use lazy serde only >> as we have already experimented with other serde's none of them is working >> for us Is there any configuration which enable >> serialization/deserialization while using lazy Serde. >> >> Or is there any other serde which can fine process non-utf-8 character in >> hive-2 and tez. >> >> Thank you >> >> >> Best Regards, >> Dana Ram Meghwal >> Software Engineer >> dana...@saavn.com <javascript:_e(%7B%7D,'cvml','dana...@saavn.com');> >> >> > > > -- > Dana Ram Meghwal > Software Engineer > dana...@saavn.com <javascript:_e(%7B%7D,'cvml','dana...@saavn.com');> > > -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.