OK. I followed the hive source code of org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the UDF. It is quite simple. It works fine as I expected for simple case, but when I try to run it under some complex query, the hive MR jobs failed with some strange errors. What I mean is that it failed in HIVE code base, from stuck trace, I can not see this failure has anything to do with my custom code. I would like some help if some one can tell me what went wrong. For example, I created this UDF called darray, stand for dynamic array, which supports the non-constant value as the index location of the array. The following query works fine as I expected: hive> select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.index_loc) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a limit 5;POI xxxxADDRESS some addressPOI xxxxPOI xxxxADDRESSS some address Of course, in this case, I only want the provider_str = 'POI' returned, and filter out any rows with provider_str != 'POI', so it sounds simple, I changed the query to the following: hive> select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorCannot run job locally: Input Size (= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728)Starting Job = job_201212031001_0100, Tracking URL = http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill Command = /home/yzhang/hadoop/bin/hadoop job -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 11:45:24,090 Stage-1 map = 0%, reduce = 0%2012-12-12 11:45:43,173 Stage-1 map = 100%, reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask I am only add a Where limitation, but to my surprise, the MR jobs generated by HIVE failed. I am testing this in my local standalone cluster, which is running CDH3U3 release. When I check the hadoop userlog, here is what I got: 2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: SELECT struct<_col0:bigint,_col1:string,_col2:string,_col3:string,_col4:string,_col5:string,_col6:boolean,_col7:boolean,_col8:boolean,_col9:boolean,_col10:boolean,_col11:boolean,_col12:string,_col13:string,_col14:struct<lat:double,lon:double,query_text_raw:string,query_text_normalized:string,query_string:string,llcountry:string,ipcountry:string,request_cnt:int,address:struct<country:string,state:string,zip:string,city:string,street:string,house:string>,categories_id:array<int>,categories_name:array<string>,lang_raw:string,lang_rose:string,lang:string,viewport:struct<top_lat:double,left_lon:double,bottom_lat:double,right_lon:double>>,_col15:struct<versions:int,physical_host:string,nose_request_id:string,client_type:string,ip:int,time_taken:int,user_agent:string,http_host:string,http_referrer:string,http_status:smallint,http_size:int,accept_language:string,md5:string,datacenter:string,tlv_map_data_version:string,tlv_devide_software_version:string,csid:int,rid:string,xncrid:string,cbfn:string,sources:array<struct<tm:bigint,tm_date:string,tm_time:string,md5:string,time_taken:int>>>,_col16:array<struct<provider_str:string,name:string,lat:double,lon:double,dyn:boolean,authoritative:boolean,search_center:boolean>>,_col17:array<struct<rank:int,action:int,tm:bigint,event:string,is_csid:boolean,is_rid:boolean,is_pbapi:boolean,is_nac:boolean>>,_col18:string,_col19:struct<rank:int,action:int,tm:bigint,event:string,is_csid:boolean,is_rid:boolean,is_pbapi:boolean,is_nac:boolean>>2012-12-12 11:40:22,440 WARN org.apache.hadoop.mapred.Child: Error running childjava.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264)Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 9 moreCaused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 14 moreCaused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 17 moreCaused by: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121) ... 22 moreCaused by: java.lang.RuntimeException: cannot find field results from [0:_col0, 1:_col1, 2:_col2, 3:_col3, 4:_col4, 5:_col5, 6:_col6, 7:_col7, 8:_col8, 9:_col9, 10:_col10, 11:_col11, 12:_col12, 13:_col13, 14:_col14, 15:_col15, 16:_col16, 17:_col17, 18:_col18, 19:_col19] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321) at org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:119) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:374) at org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator.initializeOp(LateralViewJoinOperator.java:109) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:374) at org.apache.hadoop.hive.ql.exec.UDTFOperator.initializeOp(UDTFOperator.java:85) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:62) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:374) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98) ... 22 more2012-12-12 11:40:22,443 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task Of course, this whole table is built on very complex nested (struct + array ....., you can see the struct in the top part of stuck trace) structure with my own custom InputFormat/SerDe class. But everything so far works fine. I can query this table anyway I want without seeing errors. This is the first problem I met when I tried to query it in a way I want, but is given an error I am not sure the reason. If you need more information, I will be happy to provide, but right now I am stuck with the above error, and not sure is it a HIVE bug, or I did something wrong. Thanks Yong From: java8...@hotmail.com To: user@hive.apache.org Subject: Array index support non-constant expresssion Date: Tue, 11 Dec 2012 17:24:52 -0500
Hi, In our project to use the HIVE on CDH3U4 release (Hive 0.7.1), I have a hive table like the following: Table foo ( search_results array<struct<id:bigint,..............>> search_clicks array<struct<index_loc:int, ..............>>) As you can see, the 2nd column, which represents the list of search results clicked, contains the index location of which results was clicked (starting from 0). Now I need to find out the click count of some IDs, but this IDs only exists in the search_result struct. When I tried to write a query for this, I will try to do following first: select search_results[c.index_loc] from foo lateral view explode(search_clicks) clickTable as c; But it won't work in hive, as the following error message returned:AILED: Error in semantic analysis: Line 1:7 Non-constant expressions for array indexes not supported rank. It looks like that currently Hive (at least 0.7) does NOT support non-constant expressions as the array index. I searched on google, and found out the following HIVE jira ticket: https://issues.apache.org/jira/browse/HIVE-1955 It looks like someone has the same request to support it, but not yet. But there is a comment in the above ticket that it can be done in an UDF. My question is, can anyone share some ideas about how to archive this in an UDF, as it maybe the only option for me right now? Thanks