Hi, 1.1.4 is a oldish version of the JSON serde, have you tried with the most recent from the master branch ?
On Mon, Jun 23, 2014 at 10:23 AM, Christian Link <christian.l...@mdmp.com> wrote: > Hi, > > thanks...but I need to sort things out with ONE SerDe/strategy... > I've started with André's idea by using Roberto Congiu's SerDe and André's > template to create a table with the right schema and loading the data > aftrerwards. > > But it's not completely working... > > I did the following (sorry for spaming...): > > 1. create table and load data > > -- create database (if not exists) > CREATE DATABASE IF NOT EXISTS mdmp_api_dump; > > -- connect to database; > USE mdmp_api_dump; > > -- add SerDE for json processing > ADD JAR /home/hadoop/lib/hive/json-serde-1.1.4-jar-with-dependencies.jar; > > -- drop old raw data > DROP TABLE IF EXISTS mdmp_raw_data; > > -- create raw data table > CREATE TABLE mdmp_raw_data ( > action string, > batch array< > struct< > timestamp:string, > traits:map<string,string>, > requestId:string, > sessionId:string, > event:string, > userId:string, > action:string, > context:map<string,string>, > properties:map<string,string> > > > > >, > context struct< > build:map<string,string>, > device:struct< > brand:string, > manufacturer:string, > model:string, > release:string, > sdk:int > >, > display:struct< > density:double, > height:int, > width:int > >, > integrations:map<string,boolean>, > library:string, > libraryVersion:string, > locale:map<string,string>, > location:map<string,string>, > telephony:map<string,string>, > wifi:map<string,boolean> > >, > received_at string, > requestTimestamp string, > writeKey string > ) > ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' > STORED AS TEXTFILE; > > -- load data > LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO TABLE > `mdmp_raw_data`; > > 2. run query against the "raw data" and create "formatted table": > > ADD JAR /home/hadoop/lib/hive/json-serde-1.1.4-jar-with-dependencies.jar; > > USE mdmp_api_dump; > > DROP TABLE IF EXISTS mdmp_api_data; > > CREATE TABLE mdmp_api_data AS > SELECT DISTINCT > a.action, > a.received_at, > a.requestTimestamp, > a.writeKey, > a.context.device.brand as brand, > a.context.device.manufacturer as manufacturer, > a.context.device.model as model, > a.context.device.release as release, > a.context.device.sdk as sdk, > -- a.context.display.density as density, > a.context.display.height as height, > a.context.display.width as width, > a.context.telephony['radio'] as tel_radio, > a.context.telephony['carrier'] as tel_carrier, > a.context.wifi['connected'] as wifi_connected, > a.context.wifi['available'] as wifi_available, > a.context.locale['carrier'] as loce_carrier, > a.context.locale['language'] as loce_language, > a.context.locale['country'] as loce_country, > a.context.integrations['Tapstream'] as int_tapstream, > a.context.integrations['Amplitude'] as int_amplitude, > a.context.integrations['Localytics'] as int_localytics, > a.context.integrations['Flurry'] as int_flurry, > a.context.integrations['Countly'] as int_countly, > a.context.integrations['Quantcast'] as int_quantcast, > a.context.integrations['Crittercism'] as int_crittercism, > a.context.integrations['Google Analytics'] as int_googleanalytics, > a.context.integrations['Mixpanel'] as int_mixpanel, > b.batch.action AS b_action, > b.batch.context, > b.batch.event, > b.batch.properties, > b.batch.requestId, > b.batch.sessionId, > b.batch.timestamp, > b.batch.traits, > b.batch.userId > FROM mdmp_raw_data a > LATERAL VIEW explode(a.batch) b AS batch; > > So far so good... (besides a silly double/int bug in the outdated SerDe) > I thought. > > But it turned out, that some fields are NULL - within all records. > > Affected fields are: > b.batch.event, > b.batch.requestId, > b.batch.sessionId, > b.batch.userId > > I can see values in the json file, but neither in the "raw table" nor in > the final table...that's really strange. > > An example record: > {"requestTimestamp":"2014-06-19T14:25:26+02:00","context":{"libraryVersion":"0.6.13","telephony":{"radio":"gsm","carrier":"o2 > - > de"},"wifi":{"connected":true,"available":true},"location":{},"locale":{"carrier":"o2 > - > de","language":"Deutsch","country":"Deutschland"},"library":"analytics-android","device":{"brand":"htc","model":"HTC > One > S","sdk":16,"release":"4.1.1","manufacturer":"HTC"},"display":{"density":1.5,"width":540,"height":960},"build":{"name":"1.0","code":1},"integrations":{"Tapstream":false,"Amplitude":false,"Localytics":false,"Flurry":false,"Countly":false,"Bugsnag":false,"Quantcast":false,"Crittercism":false,"Google > Analytics":false,"Mixpanel":false}},"batch":[{"timestamp":"2014-06-19T14:25:17+02:00","requestId":"32377337-3f99-4ac5-bfc6-d3654584655b","sessionId":"75cd18db8a364c2","event":"TEST > Doge > Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff > ruff!"}},{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"fbfd45c9-cf0f-4cb3-955c-85c65220a5bd","sessionId":"75cd18db8a364c2","event":"TEST > Doge > Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,08"}},{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"3a643b12-64e5-4a7c-b44b-e3e09dbc5b66","sessionId":"75cd18db8a364c2","event":"TEST > Doge > Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Wow..."}},{"action":"identify","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"timestamp":"2014-06-19T14:25:19+02:00","traits":{"email":" > do...@mdmp.com","name":"Carmelo > Doge"},"requestId":"ef2910f4-cd4f-4175-89d0-7d91b35c229f","sessionId":"75cd18db8a364c2","userId":"doge74167705ruffruff"},{"timestamp":"2014-06-19T14:25:19+02:00","requestId":"1676bb06-abee-4135-a206-d57c4a1bc24d","sessionId":"75cd18db8a364c2","event":"TEST > Doge App > Usage","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{}},{"timestamp":"2014-06-19T14:25:20+02:00","requestId":"66532c8a-c5da-4852-b8b6-04df8f3052d5","sessionId":"75cd18db8a364c2","event":"TEST > Doge > Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Many > data."}},{"timestamp":"2014-06-19T14:25:21+02:00","requestId":"a1a79d8c-fe58-4567-8dec-a8d1d2ae2713","sessionId":"75cd18db8a364c2","event":"TEST > Doge > Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,87"}},{"timestamp":"2014-06-19T14:25:21+02:00","requestId":"259209ac-b135-4d5f-bdac-535eccc0400e","sessionId":"75cd18db8a364c2","event":"TEST > Doge > Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Wow..."}},{"timestamp":"2014-06-19T14:25:23+02:00","requestId":"59b6d57c-c7a5-4b2a-af6d-fa10ae0de60c","sessionId":"75cd18db8a364c2","event":"TEST > Doge > Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Such > App!"}},{"timestamp":"2014-06-19T14:25:24+02:00","requestId":"8b05226f-bdf5-4af8-bb91-84da1b874c6e","sessionId":"75cd18db8a364c2","event":"TEST > Doge > Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,50"}},{"timestamp":"2014-06-19T14:25:24+02:00","requestId":"0f366675-5641-4238-b2a9-176735de6edd","sessionId":"75cd18db8a364c2","event":"TEST > Doge > Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff > ruff!"}},{"timestamp":"2014-06-19T14:25:26+02:00","requestId":"9e832534-5114-4ec1-bc20-1dcf1c354d0c","sessionId":"75cd18db8a364c2","event":"Session > end","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"start":"14:25:09","end":"14:25:26"}}],"writeKey":"a8RCFSAVjmT5qyxLKMzt12kcXWOIusvw","action":"import","received_at":"2014-06-19T12:25:29.790+00:00"} > > > Funny thing is, that I'm sure that I've seen these values earlier > today...I've reloaded the data/tables several times to see if this is still > working...well. :) > > I'm gonna stop for today...another try tomorrow. > > Thanks so far and many greetings from Berlin, > Chris > > > > > > > > > > > On Mon, Jun 23, 2014 at 6:57 PM, Sachin Goyal <sgo...@walmartlabs.com> > wrote: > >> >> You can also use hive-json-schema to automate Hive schema generation from >> JSON: >> https://github.com/quux00/hive-json-schema >> >> >> From: Nitin Pawar <nitinpawar...@gmail.com<mailto:nitinpawar...@gmail.com >> >> >> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" < >> user@hive.apache.org<mailto:user@hive.apache.org>> >> Date: Monday, June 23, 2014 at 2:25 AM >> To: "user@hive.apache.org<mailto:user@hive.apache.org>" < >> user@hive.apache.org<mailto:user@hive.apache.org>> >> Subject: Re: how to load json with nested array into hive? >> >> I think you can just take a look at jsonserde >> >> It does take care of nested json documents. (though you will need to know >> entire json structure upfront) >> >> Here is example of using it >> http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/ >> >> >> >> >> On Mon, Jun 23, 2014 at 2:28 PM, Christian Link <christian.l...@mdmp.com >> <mailto:christian.l...@mdmp.com>> wrote: >> Hi Jerome, >> >> thanks...I've already found "Brickhouse" and the Hive UDFs, but it didn't >> help. >> >> Today I'll try again to process the json file after going through all my >> mails...maybe I'll find a solution. >> >> Best, >> Chris >> >> >> On Fri, Jun 20, 2014 at 7:16 PM, Jerome Banks <jba...@tagged.com<mailto: >> jba...@tagged.com>> wrote: >> Christian, >> Sorry to spam this newsgroup, and this is not a commercial >> endorsement, but check out the Hive UDFs in the Brickhouse project ( >> http://github.com/klout/brickhouse ) ( >> http://brickhouseconfessions.wordpress.com/2014/02/07/hive-and-json-made-simple/ >> ) >> >> You can convert arbitrary complex Hive structures to an from json with >> it's to_json and from_json UDF's. See the blog posting for an explanation. >> >> -- jerome >> >> >> On Fri, Jun 20, 2014 at 8:26 AM, Christian Link <christian.l...@mdmp.com >> <mailto:christian.l...@mdmp.com>> wrote: >> hi, >> >> I'm very, very new to Hadoop, Hive, etc. and I have to import data into >> hive tables. >> >> Environment: Amazon EMR, S3, etc. >> >> The input file is on S3 and I copied it into my HDFS. >> >> 1. flat table with one column and loaded data into it: >> >> CREATE TABLE mdmp_raw_data (json_record STRING); >> LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO >> TABLE `mdmp_raw_data`; >> That worked, I can access some data, like this: >> >> SELECT d.carrier, d.language, d.country >> FROM mdmp_raw_data a LATERAL VIEW json_tuple(a.data, >> 'requestTimestamp', 'context') b AS requestTimestamp, context >> LATERAL VIEW json_tuple(b.context, 'locale') c AS locale >> LATERAL VIEW json_tuple(c.locale, 'carrier', 'language', 'country') d >> AS carrier, language, country >> LIMIT 1; >> >> Result: o2 - de Deutsch Deutschland >> >> I can also select the array at once: >> >> SELECT b.requestTimestamp, b.batch >> FROM mdmp_raw_data a >> LATERAL VIEW json_tuple(a.data, 'requestTimestamp', 'batch') b AS >> requestTimestamp, batch >> LIMIT 1; >> This will give me: >> >> >> [{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"2ca08247-5542-4cb4-be7e-4a8574fb77a8","sessionId":"f29ec175ca6b7d10","event":"TEST >> Doge >> Comments","userId":"doge96514016ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Much >> joy."}}, ...] >> >> This "batch" may contain n events will a structure like above. >> >> I want to put all events in a table where each "element" will be stored >> in a unique column: timestamp, requestId, sessionId, event, userId, action, >> context, properties >> >> 2. explode the "batch" I read a lot about SerDe, etc. - but I don't get >> it. >> >> - I tried to create a table with an array and load the data into it - >> several errors >> use explode in query but it doesn't accept "batch" as array >> - integrated several SerDes but get things like "unknown function jspilt" >> - I'm lost in too many documents, howtos, etc. and could need some >> advices... >> >> Thank you in advance! >> >> Best, Chris >> >> >> >> >> >> -- >> Nitin Pawar >> > > -- ---------------------------------------------------------- Good judgement comes with experience. Experience comes with bad judgement. ---------------------------------------------------------- Roberto Congiu - Data Engineer - OpenX tel: +1 626 466 1141