Hi Gopal, Thank you for your reply. I checked out some of the code following you suggestions and would like to double-check my understanding with you.
On Wed, Jun 10, 2015 at 12:28 AM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > Hi, > > There’s no relationship between number of containers and tasks – well the > number of tasks is the maximum number of containers you can use. > > You can run an entire vertex containing many task attempts in one > container if there are no more available – because of container reuse. > > The memory/cpu settings are actually setup via a configuration parameter – > hive.tez.container.size. > My understanding is that this is only for memory, but I couldn't find the setting for CPU. Then what is the default setting for CPU? Is it one CPU per container? > The Vertex is expanded into multiple tasks – The number of map-tasks are > determined by the split-grouping > (tez.grouping.min-size/tez.grouping.split-waves) and the reducers are > estimated from the ReduceSink statistics (divided by > hive.exec.bytes.per.reducer). > So # of reducers = size of the data to deal with / bytes per reducer, and round this up to the closest number that is a power of 2. > > Even the reducer number is not final, since the plan-time value is only > the max value for that - you can schedule 1009 reducers and end up only > running 11, with Tez auto-reducer parallelism, which only merges adjacent > reducers. > I'm not quite sure if I understand this. How does this actual reducer number differ from the plan-time value (I assume this is the value that calculated with size of data divided by bytes per reducer)? My guess would be you the max value is computed assuming mappers do no filter out any records, but at runtime you will recalculate the number based on how much data you actually end up having. Is that correct? What is this Tez auto-reducer parallelism and where can I find the corresponding code? > > This is split between the Tez SplitGrouper, HiveSplitGenerator and > SetReducerParallelism. > tez/SplitGrouper and HiveSplitGenerator for Mappers and SetReducerParallelism for Reducers, correct? > > Cheers, > Gopal > Thank you very much! Yunqi > > From: Yunqi Zhang <yu...@umich.edu> > Reply-To: "user@hive.apache.org" <user@hive.apache.org> > Date: Tuesday, June 9, 2015 at 5:07 PM > To: "user@hive.apache.org" <user@hive.apache.org> > Subject: Hive on Tez > > Hi guys, > > > > I’m playing with the code that integrates Hive on Tez, and have couple > questions regarding to the resource allocation. > > > > To my understanding (correct me if I am wrong), Hive creates a DAG > composed of MapVertex and ReduceVertex, where each Vertex will later be > translated to task running on potentially multiple containers by Tez. I was > wondering how the resource requirement is determined (how many containers > are needed for each Vertex, and what are the requirements for CPU and > memory, etc.) in the current implementation, and where I can find the code > corresponding to this. > > > > Thank you! > > > Yunqi >