Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

Rui Fan Wed, 31 Jan 2024 02:57:12 -0800

Sorry for the late reply.


> So you would have a high data skew while 1 subtask is receiving all the
data, but on average (say over 1-2 days) data skew would come down to 0
because all subtasks would have received their portion of the data.
> I'm inclined to think that the current proposal might still be fair, as
you do indeed have a skew by definition (but an intentional one). We can
have a few ways forward:
>
> 0) We can keep the behaviour as proposed. My thoughts are that data skew
is data skew, however intentional it may be. It is not necessarily bad,
like in your example.

It makes sense to me. Flink should show data skew correctly
regardless of whether the user is intentional or not.


> 1) Show data skew based on the beginning of time (not a live/current score).
I mentioned some downsides to this in the FLIP: If you break or fix your
data skew recently, the historical data might hide the recent fix/breakage,
and it is inconsistent with the other metrics shown on the vertices e.g.
Backpressure/Busy metrics show the live/current score.
>
> 2) We can choose not to put data skew score on the vertices on the job
graph. And instead just use the new proposed Data Skew tab which could show
live/current skew score and the total data skew score from the beginning of
job.

It makes sense, we can show the current skew score in the DAG WebUI by
default,
and provide the total and current score in the detailed tab.

I didn't see the detailed design in the FLIP, would you mind
improve the design doc? Thanks

Also, I have 2 questions for now:

1. About the current skew score, I still don't understand how to get
the list_of_number_of_records_received_by_each_subtask for
each subtask.

the list_of_number_of_records_received_by_each_subtask of subtask1
is total received records of subtask 1 from beginning to now -
total received records of subtask 1 from beginning to (now - 1min), right?

Note: 1min is an example. 30s or 2min is fine for me.

2. The skew score is percent

I'm not sure whether the score shown in percent format is reasonable.
For busy ratio or backpressure ratio, they are shown in percent format
is intuitive.

IIUC, you proposed score is between 0% to 100%, and 0% is the best.
And the 100% is the worst.

For data skew, I'm not sure whether a multiple value is more intuitive.
It means data skew score = max / mean.

For example, we have 5 subtasks, the received record numbers are
[10,10, 10, 100, 10].
data skew score = max / mean = 100 / (140/5) = 100/ 28 = 3.57.

The data skew score is between 1 and infinity. 1 is the best, and
the bigger the worse.

Looking forward to your opinions.

Best,
Rui

On Tue, Jan 23, 2024 at 6:41 PM Kartoglu, Emre <kar...@amazon.co.uk.invalid>
wrote:

> Hi Krzysztof,
>
> Thank you for the feedback! Please find my comments below.
>
> 1. Configurability
>
> Adding a feature flag / configuration to enable this is still on the table
> as far as I am concerned. However I believe adding a new metric shouldn't
> warrant a flag/configuration. One might argue that we should have it for
> showing the metrics on the Flink UI, and I'd appreciate input on this. My
> default position is to not have a configuration/flag unless there is a good
> reason (e.g. it turns out there is impact on Flink UI for so far unknown
> reason). This is because the proposed change should only be improving the
> experience without any unwanted side effect.
>
> 2. Metrics
>
> I agree the new metrics should be compatible with the rest of the Flink
> metric reporting mechanism. I will update the FLIP and propose names for
> the metrics.
>
> Kind regards,
> Emre
>
> On 23/01/2024, 10:31, "Krzysztof Dziołak" <kdzio...@live.com <mailto:
> kdzio...@live.com>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> Hi Emre,
>
>
> Thank you for driving this proposal. I've got two questions about the
> extensions to the proposal that are not captured in the FLIP.
>
>
>
>
> 1. Configurability - what kind of configuration would you propose to
> maintain for this feature? Would On/off switch and/or aggregated period
> length be configurable? Should we capture the toggles in the FLIP ?
> 2. Metrics - are we planning to emit the skew metric via metric reporters
> mechanism. Should we capture proposed metric schema in the FLIP ?
>
>
> Kind regards,
> Krzysztof
>
>
> ________________________________
> From: Kartoglu, Emre <kar...@amazon.co.uk.inva <mailto:
> kar...@amazon.co.uk.inva>LID>
> Sent: Monday, January 15, 2024 4:59 PM
> To: dev@flink.apache.org <mailto:dev@flink.apache.org> <
> dev@flink.apache.org <mailto:dev@flink.apache.org>>
> Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
>
>
> Hello,
>
>
> I’m opening this thread to discuss a FLIP[1] to make data skew more
> visible on Flink Dashboard.
>
>
> Data skew is currently not as visible as it should be. Users have to click
> each operator and check how much data each sub-task is processing and
> compare the sub-tasks against each other. This is especially cumbersome and
> error-prone for jobs with big job graphs and high parallelism. I’m
> proposing this FLIP to improve this.
>
>
> Kind regards,
> Emre
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> >
>
>
>
>
>
>
>
>
>
>

Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

Reply via email to