It might be worth noting that this is already a work-around that get_json_object() doesn't return an array if the key we're specifying is an array, but instead returns a string. The BI user wants to do the following:
select get_json_object(roster_json, '$.memberList.playerId') players And he wants the result in the "players" column to be an actual array. The output is instead a plain old string that looks like "[player1,player2,...,player8]". That's why he's building an array on the inner query that he subsequently explodes on the outer one. We think this is probably a bug, but we're really not sure. Does anyone have any feedback? Does it look like a bug or an error on the part of our BI team? On Thu, Jan 19, 2012 at 1:27 PM, Time Less <[email protected]> wrote: > We are running this query: > > select name, sum_id > from ( > select name, players, > array(player1, player2, player3, player4, player5, player6, player7, > player8) arr > from ( > select name, > get_json_object(roster_json, '$.memberList.playerId') players, > get_json_object(roster_json, '$.memberList.playerId\[0]') player1, > get_json_object(roster_json, '$.memberList.playerId\[1]') player2, > get_json_object(roster_json, '$.memberList.playerId\[2]') player3, > get_json_object(roster_json, '$.memberList.playerId\[3]') player4, > get_json_object(roster_json, '$.memberList.playerId\[4]') player5, > get_json_object(roster_json, '$.memberList.playerId\[5]') player6, > get_json_object(roster_json, '$.memberList.playerId\[6]') player7, > get_json_object(roster_json, '$.memberList.playerId\[7]') player8 > from team > *-- limit 1000000 -- some large number to make sure it runs* > ) t2 > where player8 is not null -- 8 members > ) t1 > lateral view explode(arr) member as sum_id > ; > > Notice the commented-out LIMIT in the innermost subquery. When we attempt > to run the query, we get this error: > > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks is set to 0 since there's no reduce operator > Starting Job = job_201112131753_22484, Tracking URL = > http://laxhadoop1-001:50030/jobdetails.jsp?jobid=job_201112131753_22484 > Kill Command = /usr/lib/hadoop/bin/hadoop job > -Dmapred.job.tracker=laxhadoop1-001:54311 -kill job_201112131753_22484 > 2012-01-17 11:06:21,579 Stage-1 map = 0%, reduce = 0% > 2012-01-17 11:06:45,745 Stage-1 map = 100%, reduce = 100% > Ended Job = job_201112131753_22484 with errors > java.lang.RuntimeException: Error while reading from task log url > at > org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130) > at > org.apache.hadoop.hive.ql.exec.ExecDriver.showJobFailDebugInfo(ExecDriver.java:889) > at > org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:680) > at > org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) > at > org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > Caused by: java.io.IOException: Server returned HTTP response code: 400 > for URL: > http://laxhadoop1-004:50060/tasklog?taskid=attempt_201112131753_22484_m_000000_0&all=true > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436) > at java.net.URL.openStream(URL.java:1010) > at > org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120) > ... 16 more > Ended Job = job_201112131753_22484 with exception > 'java.lang.RuntimeException(Error while reading from task log url)' > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.MapRedTask > > If we uncomment the LIMIT clause, then the query *runs WITHOUT ERROR* and > gives us the output we expect. However, the innermost subquery is about > 150k rows of data, so the LIMIT has no effect other than working around > this bug. I don't know the command to determine the version of Hive we're > running, so hopefully seeing the installed RPM will determine that: > > $ rpm -qa | grep hive > hadoop-hive-0.7.1+42.4-2 > > Does anyone have a clue about what's going on here? > > -- > Tim Ellis > Riot Games > > -- Tim
