Re:How the actual "sample data" are implemented when using tez reduce auto-parallelism

LLBian Fri, 26 Feb 2016 22:15:57 -0800

Oh,I saw some useful mesage about statistics on data from TEZ_1167.
 now, my main confusions are:
(1) how does the reduce ShuffleVertexManger know how many sample data is enough 
to  estimate the whole vertex parallelism.
(2) the relationship between edge and event


I am eager to get your instruction. 
Any reply  would be very very grateful.

At 2016-02-27 11:13:48, "LLBian" <linanmengxia...@126.com> wrote:
>
>
>Hello, Respected experts:
> 
>Recently, I am studying  tez reduce auto-parallelism, I read the article 
>"Apache Tez: Dynamic Graph Reconfiguration",TEZ-398 and HIVE-7158.
>I found the HIVE-7158 said that "Tez can optionally sample data from a 
>fraction of the tasks of a vertex and use that information to choose the 
>number of downstream tasks for any given scatter gather edge".
>I know how to use this optimization function，but I was so confused by this:
>
>" Tez defines a VertexManager event that can be used to send an arbitrary user 
>payload to the vertex manager of a given vertex. The partitioning tasks (say 
>the Map tasks) use this event to send statistics such as the size of the 
>output partitions produced to the ShuffleVertexManager for the reduce vertex. 
>The manager receives these events and tries to model the final output 
>statistics that would be produced by the all the tasks."
>
>(1)How the actual "sample data" are implemented？I mean how does the reduce 
>ShuffleVertexManger know how many sample data is enough to  estimate the whole 
>vertex parallelism, is that relates to reduce slow-start?  I studied the 
>source code of apache tez-0.7.0, but still not very clear. Mybe I was too 
>stupid to understood that.
>(2)Is the partitioning tasks proactively send their output data stats to the 
>consumer ShuffleVertexManger ? The event is sended by RPC or http?
>
>I am eager to get your instruction. 
>
>Any reply would be very very grateful.
>
>
>LuluBian

Re:How the actual "sample data" are implemented when using tez reduce auto-parallelism

Reply via email to