Hi Nikolaos +user@hive list
Hive not running a tez job is because of fetch task optimization which directly fetches data and run it through operator pipeline for specific set of queries. If you want to fully disable it try “set hive.fetch.task.conversion=none”. If you want to trigger it for much smaller data sizes lower the value for hive.fetch.task.conversion.threshold. Thanks Prasanth On Jun 28, 2018, at 10:50 AM, Nikolaos Tsipas <nicktg...@gmail.com<mailto:nicktg...@gmail.com>> wrote: Hi, I'm using Tez with Hive to query data on S3 and I notice the following two cases. Case A When the query is covering a smaller amount of data a TEZ job (yarn application) is not created select dt from my_db_schema.my_table where dt in ('2018-03-10','2018-03-09') and header ='xxx'; The output in the above case is: OK SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 2018-03-10 2018-03-10 2018-03-09 2018-03-09 Time taken: 7.043 seconds, Fetched: 4 row(s) Case B When the query is scanning more data select dt from my_db_schema.my_table where header ='xxx'; then the output is as follows and I can see a TEZ job logged in the TEZ ui and in yarn. ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 22 22 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 38.12 s ---------------------------------------------------------------------------------------------- OK 2018-03-05 2018-03-05 2018-03-06 2018-03-06 2018-03-07 2018-03-07 2018-03-08 2018-03-08 2018-03-09 2018-03-09 2018-03-10 2018-03-10 2018-03-25 2018-03-25 2018-03-26 2018-03-26 2018-03-28 2018-03-28 2018-05-09 2018-05-09 2018-05-10 2018-05-10 Time taken: 47.197 seconds, Fetched: 22 row(s) The problem in case A is that sometimes Hive decides not to trigger a TEZ job and the query is taking a long time to complete. In this case the worker nodes are not utilised at all, it's only the master node executing the query. Is there a way to force Hive to always trigger a TEZ job?