Hi Rong Rong,

Thanks for the proposal. We are also suffering from some pains brought by
history server. To address them, we propose a trace system, which is very
similar to the metric system, for historical information.

A trace is semi-structured information about events in Flink. Useful traces
include:
* job traces: which contain the job graph of submitted jobs.
* schedule traces: A schedule trace is typically composed of the
information of task slots. They are generated when a job finishes, fails,
or is canceled. As a job may restart mutliple times, a job typically has
multiple schedule traces.
* checkpoint traces: which are generated when a checkpoint completes or
fails.
* task manager traces: which are generated when a task manager terminates.
Users can access the link to aggregated logs intaskmanager traces.

Users can use TraceReport to collect traces in Flink and export them to
external storage (e.g., ElasticSearch). By retrieving traces when
exceptions happen, we can improve user experience in altering.

Regards,
Xiaogang

Rong Rong <walter...@gmail.com> 于2020年2月13日周四 上午9:41写道:

> Hi All,
>
> Recently we have been experimenting using Flink’s history server as a
> centralized debugging service for completed streaming jobs.
>
> Specifically, we dynamically generate links to access log files on the YARN
> host; in the meantime, we use the Flink history server to show job graphs,
> exceptions and other info of the completed jobs[2].
>
> This causes some pain for our users, namely: It is inconvenient to go to
> YARN host to access logs; then go to Flink history server for the other
> information.
>
> Thus we would like to propose an improvement to the currently Flink history
> server:
>
>    -
>
>    To support dynamic links to residual log files from the host machine
>    within the retention period [3];
>    -
>
>    To support dynamic links to aggregated log files provided by the
>    cluster, if supported: such as Hadoop HistoryServer[1], or Kubernetes
>    cluster level logging[4]?
>    -
>
>       Similar integration with Hadoop HistoryServer was already proposed
>       before[5] with slightly different approach.
>
>
> Any feedback and suggestions are highly appreciated!
>
> --
>
> Rong
>
> [1]
>
> https://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html
>
> [2]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/historyserver.html
>
> [3]
>
> https://hadoop.apache.org/docs/r2.9.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml#yarn.nodemanager.log.retain-seconds
>
> [4]
>
> https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures
> [5] https://issues.apache.org/jira/browse/FLINK-14317
>

Reply via email to