Thanks Siddharth, at a first glimpse I couldn't find an option in hive to disable split grouping, but I will check and eventually try the min-max setting for split size.
Thanks a lot Fabio On Thu, Feb 19, 2015 at 11:02 AM, Siddharth Seth <ss...@apache.org> wrote: > Fabio, > One of the simplest ways to achieve this is to disable split grouping > completely. You may end up with a large number of tasks in this case > though. This gets rid of the dynamic split generation based on cluster > node. (You'll have to check with Hive on how to disable this). > Other than this, setting min/max-size to the same value should produce the > desired results; there can be some variances in the groups generated though > - based on the order in which HDFS gives back it's block locations. > > > On Thu, Feb 19, 2015 at 1:47 AM, Fabio C. <anyte...@gmail.com> wrote: > >> Hi everyone, >> I see that Hive on Tez dynamically chooses the number of tasks to launch >> for each vertex in the generated DAG according to cluster load (other than >> data size). >> For research purposes I'd like to avoid this feature since I need every >> query (running on the same datasets) to be executed with the same number of >> tasks, regardless of the state of the cluster (if I run query X, n tasks >> have to be allocated in any case). >> At this point I can't make tests with heavy workloads, so I want to ask >> you if you think setting tez.am.grouping.min-size and >> tez.am.grouping.max-size to the same value can do the trick, or if you have >> any better suggestion to achieve this behavior. >> Other than this feature, is there anything else that could change the >> number of splits across different runs of the same query? >> >> Thanks a lot >> >> Fabio >> >> >