ok - same results...so at least this seems to be stable ;) I try to find a clever way to get the values of "integrations".
On Tue, Jun 24, 2014 at 12:37 PM, Christian Link <christian.l...@mdmp.com> wrote: > Nope! > Still not working...new error > > Caused by: java.lang.ClassCastException: > org.apache.hadoop.io.DoubleWritable cannot be cast to > org.apache.hadoop.hive.serde2.io.DoubleWritable > > Caused by query: > SELECT DISTINCT > a.context.device.brand as brand, > a.context.device.manufacturer as manufacturer, > a.context.device.model as model, > a.context.device.release as release, > a.context.device.sdk as sdk, > *a.context.display.density as density*, > a.context.display.height as height, > a.context.display.width as width > FROM mdmp_raw_data a > LATERAL VIEW explode(a.batch) b AS batch > WHERE a.requestTimestamp = '2014-06-19T14:08:15+02:00' > AND a.received_at = '2014-06-19T12:08:18.734+00:00'; > > Table structure > > context > struct > < > build:map<string,string>, > > device:struct<brand:string,manufacturer:string,model:string,release:string,sdk:int>, > display:struct<*density:double*,height:int,width:int>, > integrations:map<string,boolean>, > library:string,libraryversion:string, > locale:map<string,string>, > location:map<string,string>, > telephony:map<string,string>, > wifi:map<string,boolean> > > from deserializer > > It's "density" which is causing the problem. > When I leave out density of the query, it's running fine. > > > Also, the values in "intergrations" are NULL...which had been NOT NULL, > yesterday...strange. > > table mdmp_raw_data (data imported using 1.1.9.2 Serde): > > SELECT DISTINCT > a.context.integrations > FROM mdmp_raw_data a > LATERAL VIEW explode(a.batch) b AS batch > WHERE a.requestTimestamp = '2014-06-19T14:08:15+02:00' > AND a.received_at = '2014-06-19T12:08:18.734+00:00'; > > > {"tapstream":false,"amplitude":false,"flurry":false,"quantcast":false,"countly":false,"mixpanel":false,"crittercism":false,"google > analytics":false,"bugsnag":false,"localytics":false} > > json file: > "integrations":{"Tapstream":false,"Amplitude":false,"Localytics":false,"Flurry":false,"Countly":false,"Bugsnag":false,"Quantcast":false,"Crittercism":false,"Google > Analytics":false,"Mixpanel":false}} > > So they keys had been changed to LOWER CASE...hum, not sure if this makes > sense...and WHY it happend. > > Again...I will import the data again and I will run the same > queries...let's see what will change next. > > THANKS a lot for all the support. > > Best, > Chris > > > > > On Tue, Jun 24, 2014 at 12:11 PM, Christian Link <christian.l...@mdmp.com> > wrote: > >> I've got 1.1.9.2 (thanks to Roberto) and the data is looking better... >> I'll test the "doubel/int" thingie, now. >> >> Best, >> Chris >> >> >> On Mon, Jun 23, 2014 at 8:51 PM, Swagatika Tripathy < >> swagatikat...@gmail.com> wrote: >> >>> Hi, >>> Use 1.9.3 Jason serde with dependencies jar. Its the latest one I >>> suppose. >>> >>> Thanks >>> Swagatika >>> On Jun 23, 2014 11:57 PM, "Roberto Congiu" <roberto.con...@openx.com> >>> wrote: >>> >>>> Hi, >>>> 1.1.4 is a oldish version of the JSON serde, have you tried with the >>>> most recent from the master branch ? >>>> >>>> >>>> On Mon, Jun 23, 2014 at 10:23 AM, Christian Link < >>>> christian.l...@mdmp.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> thanks...but I need to sort things out with ONE SerDe/strategy... >>>>> I've started with André's idea by using Roberto Congiu's SerDe and >>>>> André's template to create a table with the right schema and loading the >>>>> data aftrerwards. >>>>> >>>>> But it's not completely working... >>>>> >>>>> I did the following (sorry for spaming...): >>>>> >>>>> 1. create table and load data >>>>> >>>>> -- create database (if not exists) >>>>> CREATE DATABASE IF NOT EXISTS mdmp_api_dump; >>>>> >>>>> -- connect to database; >>>>> USE mdmp_api_dump; >>>>> >>>>> -- add SerDE for json processing >>>>> ADD JAR >>>>> /home/hadoop/lib/hive/json-serde-1.1.4-jar-with-dependencies.jar; >>>>> >>>>> -- drop old raw data >>>>> DROP TABLE IF EXISTS mdmp_raw_data; >>>>> >>>>> -- create raw data table >>>>> CREATE TABLE mdmp_raw_data ( >>>>> action string, >>>>> batch array< >>>>> struct< >>>>> timestamp:string, >>>>> traits:map<string,string>, >>>>> requestId:string, >>>>> sessionId:string, >>>>> event:string, >>>>> userId:string, >>>>> action:string, >>>>> context:map<string,string>, >>>>> properties:map<string,string> >>>>> >>>>> > >>>>> >, >>>>> context struct< >>>>> build:map<string,string>, >>>>> device:struct< >>>>> brand:string, >>>>> manufacturer:string, >>>>> model:string, >>>>> release:string, >>>>> sdk:int >>>>> >, >>>>> display:struct< >>>>> density:double, >>>>> height:int, >>>>> width:int >>>>> >, >>>>> integrations:map<string,boolean>, >>>>> library:string, >>>>> libraryVersion:string, >>>>> locale:map<string,string>, >>>>> location:map<string,string>, >>>>> telephony:map<string,string>, >>>>> wifi:map<string,boolean> >>>>> >, >>>>> received_at string, >>>>> requestTimestamp string, >>>>> writeKey string >>>>> ) >>>>> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' >>>>> STORED AS TEXTFILE; >>>>> >>>>> -- load data >>>>> LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO >>>>> TABLE `mdmp_raw_data`; >>>>> >>>>> 2. run query against the "raw data" and create "formatted table": >>>>> >>>>> ADD JAR >>>>> /home/hadoop/lib/hive/json-serde-1.1.4-jar-with-dependencies.jar; >>>>> >>>>> USE mdmp_api_dump; >>>>> >>>>> DROP TABLE IF EXISTS mdmp_api_data; >>>>> >>>>> CREATE TABLE mdmp_api_data AS >>>>> SELECT DISTINCT >>>>> a.action, >>>>> a.received_at, >>>>> a.requestTimestamp, >>>>> a.writeKey, >>>>> a.context.device.brand as brand, >>>>> a.context.device.manufacturer as manufacturer, >>>>> a.context.device.model as model, >>>>> a.context.device.release as release, >>>>> a.context.device.sdk as sdk, >>>>> -- a.context.display.density as density, >>>>> a.context.display.height as height, >>>>> a.context.display.width as width, >>>>> a.context.telephony['radio'] as tel_radio, >>>>> a.context.telephony['carrier'] as tel_carrier, >>>>> a.context.wifi['connected'] as wifi_connected, >>>>> a.context.wifi['available'] as wifi_available, >>>>> a.context.locale['carrier'] as loce_carrier, >>>>> a.context.locale['language'] as loce_language, >>>>> a.context.locale['country'] as loce_country, >>>>> a.context.integrations['Tapstream'] as int_tapstream, >>>>> a.context.integrations['Amplitude'] as int_amplitude, >>>>> a.context.integrations['Localytics'] as int_localytics, >>>>> a.context.integrations['Flurry'] as int_flurry, >>>>> a.context.integrations['Countly'] as int_countly, >>>>> a.context.integrations['Quantcast'] as int_quantcast, >>>>> a.context.integrations['Crittercism'] as int_crittercism, >>>>> a.context.integrations['Google Analytics'] as int_googleanalytics, >>>>> a.context.integrations['Mixpanel'] as int_mixpanel, >>>>> b.batch.action AS b_action, >>>>> b.batch.context, >>>>> b.batch.event, >>>>> b.batch.properties, >>>>> b.batch.requestId, >>>>> b.batch.sessionId, >>>>> b.batch.timestamp, >>>>> b.batch.traits, >>>>> b.batch.userId >>>>> FROM mdmp_raw_data a >>>>> LATERAL VIEW explode(a.batch) b AS batch; >>>>> >>>>> So far so good... (besides a silly double/int bug in the outdated >>>>> SerDe) >>>>> I thought. >>>>> >>>>> But it turned out, that some fields are NULL - within all records. >>>>> >>>>> Affected fields are: >>>>> b.batch.event, >>>>> b.batch.requestId, >>>>> b.batch.sessionId, >>>>> b.batch.userId >>>>> >>>>> I can see values in the json file, but neither in the "raw table" nor >>>>> in the final table...that's really strange. >>>>> >>>>> An example record: >>>>> {"requestTimestamp":"2014-06-19T14:25:26+02:00","context":{"libraryVersion":"0.6.13","telephony":{"radio":"gsm","carrier":"o2 >>>>> - >>>>> de"},"wifi":{"connected":true,"available":true},"location":{},"locale":{"carrier":"o2 >>>>> - >>>>> de","language":"Deutsch","country":"Deutschland"},"library":"analytics-android","device":{"brand":"htc","model":"HTC >>>>> One >>>>> S","sdk":16,"release":"4.1.1","manufacturer":"HTC"},"display":{"density":1.5,"width":540,"height":960},"build":{"name":"1.0","code":1},"integrations":{"Tapstream":false,"Amplitude":false,"Localytics":false,"Flurry":false,"Countly":false,"Bugsnag":false,"Quantcast":false,"Crittercism":false,"Google >>>>> Analytics":false,"Mixpanel":false}},"batch":[{"timestamp":"2014-06-19T14:25:17+02:00","requestId":"32377337-3f99-4ac5-bfc6-d3654584655b","sessionId":"75cd18db8a364c2","event":"TEST >>>>> Doge >>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff >>>>> ruff!"}},{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"fbfd45c9-cf0f-4cb3-955c-85c65220a5bd","sessionId":"75cd18db8a364c2","event":"TEST >>>>> Doge >>>>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,08"}},{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"3a643b12-64e5-4a7c-b44b-e3e09dbc5b66","sessionId":"75cd18db8a364c2","event":"TEST >>>>> Doge >>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Wow..."}},{"action":"identify","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"timestamp":"2014-06-19T14:25:19+02:00","traits":{"email":" >>>>> do...@mdmp.com","name":"Carmelo >>>>> Doge"},"requestId":"ef2910f4-cd4f-4175-89d0-7d91b35c229f","sessionId":"75cd18db8a364c2","userId":"doge74167705ruffruff"},{"timestamp":"2014-06-19T14:25:19+02:00","requestId":"1676bb06-abee-4135-a206-d57c4a1bc24d","sessionId":"75cd18db8a364c2","event":"TEST >>>>> Doge App >>>>> Usage","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{}},{"timestamp":"2014-06-19T14:25:20+02:00","requestId":"66532c8a-c5da-4852-b8b6-04df8f3052d5","sessionId":"75cd18db8a364c2","event":"TEST >>>>> Doge >>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Many >>>>> data."}},{"timestamp":"2014-06-19T14:25:21+02:00","requestId":"a1a79d8c-fe58-4567-8dec-a8d1d2ae2713","sessionId":"75cd18db8a364c2","event":"TEST >>>>> Doge >>>>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,87"}},{"timestamp":"2014-06-19T14:25:21+02:00","requestId":"259209ac-b135-4d5f-bdac-535eccc0400e","sessionId":"75cd18db8a364c2","event":"TEST >>>>> Doge >>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Wow..."}},{"timestamp":"2014-06-19T14:25:23+02:00","requestId":"59b6d57c-c7a5-4b2a-af6d-fa10ae0de60c","sessionId":"75cd18db8a364c2","event":"TEST >>>>> Doge >>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Such >>>>> App!"}},{"timestamp":"2014-06-19T14:25:24+02:00","requestId":"8b05226f-bdf5-4af8-bb91-84da1b874c6e","sessionId":"75cd18db8a364c2","event":"TEST >>>>> Doge >>>>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,50"}},{"timestamp":"2014-06-19T14:25:24+02:00","requestId":"0f366675-5641-4238-b2a9-176735de6edd","sessionId":"75cd18db8a364c2","event":"TEST >>>>> Doge >>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff >>>>> ruff!"}},{"timestamp":"2014-06-19T14:25:26+02:00","requestId":"9e832534-5114-4ec1-bc20-1dcf1c354d0c","sessionId":"75cd18db8a364c2","event":"Session >>>>> end","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"start":"14:25:09","end":"14:25:26"}}],"writeKey":"a8RCFSAVjmT5qyxLKMzt12kcXWOIusvw","action":"import","received_at":"2014-06-19T12:25:29.790+00:00"} >>>>> >>>>> >>>>> Funny thing is, that I'm sure that I've seen these values earlier >>>>> today...I've reloaded the data/tables several times to see if this is >>>>> still >>>>> working...well. :) >>>>> >>>>> I'm gonna stop for today...another try tomorrow. >>>>> >>>>> Thanks so far and many greetings from Berlin, >>>>> Chris >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jun 23, 2014 at 6:57 PM, Sachin Goyal <sgo...@walmartlabs.com> >>>>> wrote: >>>>> >>>>>> >>>>>> You can also use hive-json-schema to automate Hive schema generation >>>>>> from JSON: >>>>>> https://github.com/quux00/hive-json-schema >>>>>> >>>>>> >>>>>> From: Nitin Pawar <nitinpawar...@gmail.com<mailto: >>>>>> nitinpawar...@gmail.com>> >>>>>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" < >>>>>> user@hive.apache.org<mailto:user@hive.apache.org>> >>>>>> Date: Monday, June 23, 2014 at 2:25 AM >>>>>> To: "user@hive.apache.org<mailto:user@hive.apache.org>" < >>>>>> user@hive.apache.org<mailto:user@hive.apache.org>> >>>>>> Subject: Re: how to load json with nested array into hive? >>>>>> >>>>>> I think you can just take a look at jsonserde >>>>>> >>>>>> It does take care of nested json documents. (though you will need to >>>>>> know entire json structure upfront) >>>>>> >>>>>> Here is example of using it >>>>>> http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 23, 2014 at 2:28 PM, Christian Link < >>>>>> christian.l...@mdmp.com<mailto:christian.l...@mdmp.com>> wrote: >>>>>> Hi Jerome, >>>>>> >>>>>> thanks...I've already found "Brickhouse" and the Hive UDFs, but it >>>>>> didn't help. >>>>>> >>>>>> Today I'll try again to process the json file after going through all >>>>>> my mails...maybe I'll find a solution. >>>>>> >>>>>> Best, >>>>>> Chris >>>>>> >>>>>> >>>>>> On Fri, Jun 20, 2014 at 7:16 PM, Jerome Banks <jba...@tagged.com >>>>>> <mailto:jba...@tagged.com>> wrote: >>>>>> Christian, >>>>>> Sorry to spam this newsgroup, and this is not a commercial >>>>>> endorsement, but check out the Hive UDFs in the Brickhouse project ( >>>>>> http://github.com/klout/brickhouse ) ( >>>>>> http://brickhouseconfessions.wordpress.com/2014/02/07/hive-and-json-made-simple/ >>>>>> ) >>>>>> >>>>>> You can convert arbitrary complex Hive structures to an from json >>>>>> with it's to_json and from_json UDF's. See the blog posting for an >>>>>> explanation. >>>>>> >>>>>> -- jerome >>>>>> >>>>>> >>>>>> On Fri, Jun 20, 2014 at 8:26 AM, Christian Link < >>>>>> christian.l...@mdmp.com<mailto:christian.l...@mdmp.com>> wrote: >>>>>> hi, >>>>>> >>>>>> I'm very, very new to Hadoop, Hive, etc. and I have to import data >>>>>> into hive tables. >>>>>> >>>>>> Environment: Amazon EMR, S3, etc. >>>>>> >>>>>> The input file is on S3 and I copied it into my HDFS. >>>>>> >>>>>> 1. flat table with one column and loaded data into it: >>>>>> >>>>>> CREATE TABLE mdmp_raw_data (json_record STRING); >>>>>> LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO >>>>>> TABLE `mdmp_raw_data`; >>>>>> That worked, I can access some data, like this: >>>>>> >>>>>> SELECT d.carrier, d.language, d.country >>>>>> FROM mdmp_raw_data a LATERAL VIEW json_tuple(a.data, >>>>>> 'requestTimestamp', 'context') b AS requestTimestamp, context >>>>>> LATERAL VIEW json_tuple(b.context, 'locale') c AS locale >>>>>> LATERAL VIEW json_tuple(c.locale, 'carrier', 'language', 'country') >>>>>> d AS carrier, language, country >>>>>> LIMIT 1; >>>>>> >>>>>> Result: o2 - de Deutsch Deutschland >>>>>> >>>>>> I can also select the array at once: >>>>>> >>>>>> SELECT b.requestTimestamp, b.batch >>>>>> FROM mdmp_raw_data a >>>>>> LATERAL VIEW json_tuple(a.data, 'requestTimestamp', 'batch') b AS >>>>>> requestTimestamp, batch >>>>>> LIMIT 1; >>>>>> This will give me: >>>>>> >>>>>> >>>>>> [{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"2ca08247-5542-4cb4-be7e-4a8574fb77a8","sessionId":"f29ec175ca6b7d10","event":"TEST >>>>>> Doge >>>>>> Comments","userId":"doge96514016ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Much >>>>>> joy."}}, ...] >>>>>> >>>>>> This "batch" may contain n events will a structure like above. >>>>>> >>>>>> I want to put all events in a table where each "element" will be >>>>>> stored in a unique column: timestamp, requestId, sessionId, event, >>>>>> userId, >>>>>> action, context, properties >>>>>> >>>>>> 2. explode the "batch" I read a lot about SerDe, etc. - but I don't >>>>>> get it. >>>>>> >>>>>> - I tried to create a table with an array and load the data into it - >>>>>> several errors >>>>>> use explode in query but it doesn't accept "batch" as array >>>>>> - integrated several SerDes but get things like "unknown function >>>>>> jspilt" >>>>>> - I'm lost in too many documents, howtos, etc. and could need some >>>>>> advices... >>>>>> >>>>>> Thank you in advance! >>>>>> >>>>>> Best, Chris >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Nitin Pawar >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> ---------------------------------------------------------- >>>> Good judgement comes with experience. >>>> Experience comes with bad judgement. >>>> ---------------------------------------------------------- >>>> Roberto Congiu - Data Engineer - OpenX >>>> tel: +1 626 466 1141 >>>> >>> >> >