Optimizing limit 0 / null scan cases (HIVE-7203)

Abhilash L L Thu, 18 Aug 2016 22:37:50 -0700

Hello,

    We are on hive 1.2.x.


    We are generating queries of the form (example)

    CREATE TABLE T .....  AS
    WITH
     T1 AS (SELECT * FROM S1.A),
     T2 AS (SELECT * FROM S2.B WHERE x > 1000),
     T3 AS (SELECT * FROM T1 JOIN T2 ON T1.x=T2.y AND T1.z>T2.z)
     SELECT * from T3

     We are generating much larger queries than above and we also use UDFs
which are added on the fly and registered as temporary functions. However,
since the above syntax does not allow specifying partitions, we will need
to create the table before hand and then run the same query again to
insert. Also we need to extract the types of the output of the final query.
Lot of tools also are doing the same.

     To extract the types for the create table, we just added LIMIT 0
assuming it would skip running MR/Tez jobs and return fast.

     When on MR its launching all the jobs and finally throws away the
records, which is taking 10+ minutes for the query/cluster combination.
When on Tez its much much faster but still high.

     We tried putting LIMIT 0 and/or WHERE 1=2 always false in the query to
make it skip faster, but its not skipping launching mr jobs.

     We came across the ticket
https://issues.apache.org/jira/browse/HIVE-7203, which says the
optimisation applies only to the outer most query even when specified in
the inner queries.

     Is there any workaround which works when execution engine is either mr
or tez.

     Help / guidance would be very helpful. Please let me know if I have to
share any other information

Regards,
Abhilash

Optimizing limit 0 / null scan cases (HIVE-7203)

Reply via email to