Hi everyone,

disclaimer: i read the contribution guide about improvement requests (i.e.
i should actually just start a jira ticket) but i thought it would make
sense to run this first through the mailing list here. after collecting
some input i would then create the jira ticket.

When accessing the Flink Web Dashboard (which is basically what i do almost
every day to check some status of a job or so), I recently felt that the
actual information given in the top portion of the start page is highly
improvable. I created a first mock by moving html elements around and
wanted to share this one now:

[image: image.png]

With the exception of the metrics (see below) none of this information
should be new, but rather re-organized to speed up investigation and
monitoring:

   - complete overview on the cluster status and health, without clicking
   through a lot of pages.
   - Active and stand-by Job Managers. Also their health is depicted as a
      color (as a first suggestion: last heartbeat is inside heartbeat.timeout)
      - Current registered Task Managers
         - the little bar on the side indicates task slot usage. i did not
         color it since a fully utilised task manager is not
necessarily something
         bad.
         - the color indicates the health of the task manager (as a first
         suggestion: last heartbeat is inside heartbeat.timeout)
      - overview on some cluster metrics

Some points to notice:

   - All data you see on the screenshot is mock, no number relates to
   another number at all. but colors should relate to the numbers already
   which they indicate.
   - All of this could also be done with other monitoring solutions someone
   might have in his company, by reading out JMX metrics and then plotting
   those in his monitoring solution (e.g. grafana). But this out of the box
   solution would save everyone from doing it on their own and they could
   trust the metrics shown here.
   - Some of the metrics can only be done with FLINK-7286
   <https://issues.apache.org/jira/browse/FLINK-7286> being done. So i
   would split the implementation of this into two parts (cluster overview and
   metrics) and do them separately.
   - This first mock up is targeted to what we here at Zalando would like
   to see first glance, so it fits our use case very well. We mostly use
   long-running session clusters.
   - I'm more a Backend Guy with some Frontend expertise (but mostly in
   React, no angular1 (Flink Web Dashboard is built with this currently)
   experience) and not at all a designer.

What do you think? I would be glad to have some feedback on this,
especially if this makes sense in the broad community. I would no matter
what implement this somehow, if not in the Flink Master branch, then as a
OS project which anyone can deploy next to their flink clusters. But i
first wanted to run it through here to see if this sparks any interest.

Please also let me know if you see difficulties implementing this already,
maybe i have overseen something.

Can't wait for your input.

Cheers

--


*Fabian WollertZalando SE*

E-Mail: fab...@zalando.de

Reply via email to