John P. Petrakis created HIVE-12412:
---------------------------------------

             Summary: Multi insert queries fail to run properly in hive 1.1.x 
or later.
                 Key: HIVE-12412
                 URL: https://issues.apache.org/jira/browse/HIVE-12412
             Project: Hive
          Issue Type: Bug
    Affects Versions: 1.1.0, 1.2.0
            Reporter: John P. Petrakis


We use multi insert queries to take data in one table and manipulate it by 
inserting it into a results table.  Queries are of this form:

from (select * from data_table lateral view explode(data_table.f2) f2 as 
explode_f2) as explode_data_table  
       insert overwrite table results_table partition (q_id='C.P1',rl='1') 
       select 
       array(cast(if(explode_data_table.f1 is null or explode_data_table.f1='', 
'UNKNOWN',explode_data_table.f1) as String),cast(explode_f2.s1 as String)) as 
dimensions, 
       ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics, 
       null as rownm 
       where (explode_data_table.date_id between 20151016 and 20151016)
       group by 
       if(explode_data_table.f1 is null or explode_data_table.f1='', 
'UNKNOWN',explode_data_table.f1),
       explode_f2.s1         
       INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0') 
       SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS 
dimensions, 
       ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
       null AS rownm 
       WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) 
       INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0') 
       SELECT 
       ARRAY(CAST('Total' as String)) AS dimensions, 
       ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
       null AS rownm 
       WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)

This query is meant to total a given field of a struct that is potentially a 
list of structs.  For our test data set, which consists of a single row, the 
summation yields "Null",  with messages in the hive log of the nature:

Missing fields! Expected 2 fields but only got 1! Ignoring similar problems.
or "Extra fields detected..."

For significantly more data, this query will eventually cause a run time error 
while processing a column (caused by array index out of bounds exception in one 
of the lazy binary classes such as LazyBinaryString or LazyBinaryStruct).

Using the query above from the hive command line, the following data was used:
(note there are tabs in the data below)

string one      one:1.0:1.00:10.0,eon:1.0:1.00:100.0
string two      two:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0
string thr      three:3.0:3.00:30.0
string fou      four:4.0:4.00:40.0

There are two fields, a string, (eg. 'string one') and a list of structs.  The 
following is used to create the table:

create table if not exists t1 (
                         f1 string, 
                          f2 
array<struct<s1:string,d1:double,d2:double,d3:double>>
                         )
                          partitioned by (clid string, date_id string) 
                          row format delimited fields 
                         terminated by '09' 
                         collection items terminated by ',' 
                         map keys terminated by ':'
                         lines terminated by '10' 
                         location '/user/hive/warehouse/t1';

And the following is used to load the data:

load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE  into 
table t1  partition(client_id='987654321',date_id='20151016');

The resulting table should yield the following:
["string fou","four"]   [4.0]   null    C.P1    1       
["string one","eon"]    [1.0]   null    C.P1    1       
["string one","one"]    [1.0]   null    C.P1    1       
["string thr","three"]  [3.0]   null    C.P1    1       
["string two","otw"]    [2.0]   null    C.P1    1       
["string two","two"]    [2.0]   null    C.P1    1       
["string two","wott"]   [2.0]   null    C.P1    1       
["Total","Total"]       [15.0]  null    C.P2    0       
["Total"]       [15.0]  null    C.P5    0       

However what we get is:
Hive Runtime Error while processing row 
{"_col2":2.5306499719322744E-258,"_col3":""} (ultimately due to an array index 
out of bounds exception)

If we reduce the above data to a SINGLE row, the we don't get an exception but 
the total fields come out as NULL.

The ONLY way this query would work is 
1) if I added a group by (date_id) or even group by ('') as the last line in 
the query... or removed the last where clause for the final insert.  (The 
reason why we have a where clause on the initial select is due to push down 
predicates not working... and causing a complete scan of the entire table).

or 
2) removing the group by clauses entirely 

or using
3) hive.multigroupby.singlereducer = false

Once 3) is used no modification of the query is needed... it all runs.  This is 
very similar in nature to the following:
https://issues.apache.org/jira/browse/HIVE-2750
which is supposedly fixed and closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to