Oh,I saw some useful mesage about statistics on data from TEZ_1167. now, my main confusions are: (1) how does the reduce ShuffleVertexManger know how many sample data is enough to estimate the whole vertex parallelism. (2) the relationship between edge and event
I am eager to get your instruction. Any reply would be very very grateful. At 2016-02-27 11:13:48, "LLBian" <linanmengxia...@126.com> wrote: > > >Hello, Respected experts: > >Recently, I am studying tez reduce auto-parallelism, I read the article >"Apache Tez: Dynamic Graph Reconfiguration",TEZ-398 and HIVE-7158. >I found the HIVE-7158 said that "Tez can optionally sample data from a >fraction of the tasks of a vertex and use that information to choose the >number of downstream tasks for any given scatter gather edge". >I know how to use this optimization function,but I was so confused by this: > >" Tez defines a VertexManager event that can be used to send an arbitrary user >payload to the vertex manager of a given vertex. The partitioning tasks (say >the Map tasks) use this event to send statistics such as the size of the >output partitions produced to the ShuffleVertexManager for the reduce vertex. >The manager receives these events and tries to model the final output >statistics that would be produced by the all the tasks." > >(1)How the actual "sample data" are implemented?I mean how does the reduce >ShuffleVertexManger know how many sample data is enough to estimate the whole >vertex parallelism, is that relates to reduce slow-start? I studied the >source code of apache tez-0.7.0, but still not very clear. Mybe I was too >stupid to understood that. >(2)Is the partitioning tasks proactively send their output data stats to the >consumer ShuffleVertexManger ? The event is sended by RPC or http? > >I am eager to get your instruction. > >Any reply would be very very grateful. > > >LuluBian