Re: Benchmarking Apache Flink via Query Plan

Fabian Hueske Mon, 24 Apr 2017 14:22:37 -0700

Hi Giacomo90,

I'm not aware of a detailed description of the execution plan.
The plan can be used to identify the execution strategies (shipping and
local) chosen by the optimizer and some properties of the data
(partitioning, order).
Common shipping strategies can be FORWARD (locally forwarding, no network)
and HASH_PARTITION (shuffling by key).
Common local strategies are SORT (sorts the data set), HASH_FIRST_BUILD
(creates a hash table from the first input and probes the other second
input), SORT_MERGE (sort merge join, requires both inputs to be sorted).
There are a few more strategies.
Note that at operators in the plan can be chained together when the program
is executed and will appear as a single node.


The plan does also not contain any details about the data sizes (if you see
some numbers there, those are mostly inaccurate estimates).
The web dashboard shows some metrics on the processed data volumes.

Btw. You can visualize the JSON with this online tool [1].

Best, Fabian

[1] http://flink.apache.org/visualizer/

2017-04-22 16:29 GMT+02:00 giacom...@libero.it <giacom...@libero.it>:

> Plus, I'm currently using 1.1.2 and I cannot change version due to
> dependency
> problems.
> Thanks in advance,
>
>      Giacomo90
>
> >----Messaggio originale----
> >From: "giacom...@libero.it" <giacom...@libero.it>
> >Date: 21/04/2017 17.42
> >To: <user@flink.apache.org>
> >Subj: R: WELCOME to user@flink.apache.org
> >
> >Dear Users and Apache Flink devs,
> >
> >         For each one of my distributed computation, I'm generating and
> >reading the json files produced by the getExecutionPlan() in order to
> motivate
> >my benchmarks. Is there some guide providing an explaination of the exact
> >meaning of the fields of the generated JSON file? I'm trying to
> differentiate
> >from the timing result which part of the computation time was spent
> sending
> >messages and which time was spent during either I/O or CPU operations.
> >         By the way, I also noticed that I do not get any information
> >concerning the actual data that is been used and transmitted throughout
> the
> >network (the actual data size and the messages' data size).
> >         Moreover, currently I'm using the following way to get the JSON
> file
> >
> >> createAndRegisterDataSinks();
> >> String plan = globalEnvironment.getExecutionPlan();
> >> createAndRegisterDataSinks();
> >> globalEnvironment.execute(getClass().getSimpleName()); // Running the
> actual
> >class
> >
> >          Is there a better way to do it?
> >          Thanks in advance for your support,
> >
> >    Giacomo90
> >
>
>
>

Re: Benchmarking Apache Flink via Query Plan

Reply via email to