Re: [DISCUSS] Improve history server with log support

Yang Wang Sun, 16 Feb 2020 19:02:50 -0800

 Hi Rong Rong,


Thanks for starting this discussion. I think the log is an important part
of improving user
experience of Flink. The logs is very important for debugging problems or
checking the
expected output. Some users, especially for machine learning, print global
steps or
residual to the logs. When the application is finished successfully or not,
the logs should
also be accessible.


Currently, when deploying Flink on Yarn, the application logs will be
aggregated to HDFS
on a configured path classified by host. The command `yarn application
logs` could be used
to get the logs to local.


For K8s deployment, daemon set or sidecar container could be used to
collect logs to
persistent storage(e.g. HDFS, S3, elastic search, etc.).


So i am thinking whether we could provide a unified api and storage format
for logs. Then
we could add different implementation for storage type and use it in
history server. And users
could get the logs from the history server just like Flink cluster is
running.



Best,
Yang

Venkata Sanath Muppalla <[email protected]> 于2020年2月15日周六 下午3:19写道：

> @Xiaogang Could please share more details about the trace mechanism you
> mentioned. As Rong mentioned, we are also working on something similar.
>
> On Fri, Feb 14, 2020, 9:12 AM Rong Rong <[email protected]> wrote:
>
> > Thank you for the prompt feedbacks
> >
> > @Aljoscha. Yes you are absolutely correct - adding Hadoop dependency to
> > cluster runtime component is definitely not what we are proposing.
> > We were trying to see how the community thinks about the idea of adding
> log
> > support into History server.
> >   - The reference to this JIRA ticket is more on the intention rather
> than
> > the solution. -  in fact the intention is slightly different, we were
> > trying to put it in the history server while the original JIRA proposed
> to
> > add it in the live runtime modules.
> >   - IMO, in order to support different cluster environments: the generic
> > cluster component should only provide an interface, where each cluster
> impl
> > module should extend from.
> >
> >
> > @Xiaogang, thank you for bringing up the idea of utilizing a trace
> system.
> >
> > The event tracing would definitely provide additional, in fact more
> > valuable information for debugging purposes.
> > In fact we were also internally experimenting with the idea similar to
> > Spark's ListenerInterface [1] to capture some of the important messages
> > sent via akka.
> > But we are still in a very early preliminary stage, thus we haven't
> > included them in this discussion.
> >
> > We would love to hear more regarding the trace system you proposed. could
> > you share more information regarding this?
> > Such as how would the live events being listened; how would the trace
> being
> > collected/stored; etc.
> >
> >
> > [1]
> >
> >
> https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/scheduler/SparkListener.html
> >
> > Thanks,
> > Rong
> >
> >
> > On Thu, Feb 13, 2020 at 7:33 AM Aljoscha Krettek <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > what's the difference in approach to the mentioned related Jira Issue
> > > ([1])? I commented there because I'm skeptical about adding
> > > Hadoop-specific code to the generic cluster components.
> > >
> > > Best,
> > > Aljoscha
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-14317
> > >
> > > On 13.02.20 03:47, SHI Xiaogang wrote:
> > > > Hi Rong Rong,
> > > >
> > > > Thanks for the proposal. We are also suffering from some pains
> brought
> > by
> > > > history server. To address them, we propose a trace system, which is
> > very
> > > > similar to the metric system, for historical information.
> > > >
> > > > A trace is semi-structured information about events in Flink. Useful
> > > traces
> > > > include:
> > > > * job traces: which contain the job graph of submitted jobs.
> > > > * schedule traces: A schedule trace is typically composed of the
> > > > information of task slots. They are generated when a job finishes,
> > fails,
> > > > or is canceled. As a job may restart mutliple times, a job typically
> > has
> > > > multiple schedule traces.
> > > > * checkpoint traces: which are generated when a checkpoint completes
> or
> > > > fails.
> > > > * task manager traces: which are generated when a task manager
> > > terminates.
> > > > Users can access the link to aggregated logs intaskmanager traces.
> > > >
> > > > Users can use TraceReport to collect traces in Flink and export them
> to
> > > > external storage (e.g., ElasticSearch). By retrieving traces when
> > > > exceptions happen, we can improve user experience in altering.
> > > >
> > > > Regards,
> > > > Xiaogang
> > > >
> > > > Rong Rong <[email protected]> 于2020年2月13日周四 上午9:41写道：
> > > >
> > > >> Hi All,
> > > >>
> > > >> Recently we have been experimenting using Flink’s history server as
> a
> > > >> centralized debugging service for completed streaming jobs.
> > > >>
> > > >> Specifically, we dynamically generate links to access log files on
> the
> > > YARN
> > > >> host; in the meantime, we use the Flink history server to show job
> > > graphs,
> > > >> exceptions and other info of the completed jobs[2].
> > > >>
> > > >> This causes some pain for our users, namely: It is inconvenient to
> go
> > to
> > > >> YARN host to access logs; then go to Flink history server for the
> > other
> > > >> information.
> > > >>
> > > >> Thus we would like to propose an improvement to the currently Flink
> > > history
> > > >> server:
> > > >>
> > > >>     -
> > > >>
> > > >>     To support dynamic links to residual log files from the host
> > machine
> > > >>     within the retention period [3];
> > > >>     -
> > > >>
> > > >>     To support dynamic links to aggregated log files provided by the
> > > >>     cluster, if supported: such as Hadoop HistoryServer[1], or
> > > Kubernetes
> > > >>     cluster level logging[4]?
> > > >>     -
> > > >>
> > > >>        Similar integration with Hadoop HistoryServer was already
> > > proposed
> > > >>        before[5] with slightly different approach.
> > > >>
> > > >>
> > > >> Any feedback and suggestions are highly appreciated!
> > > >>
> > > >> --
> > > >>
> > > >> Rong
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> >
> https://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html
> > > >>
> > > >> [2]
> > > >>
> > > >>
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/historyserver.html
> > > >>
> > > >> [3]
> > > >>
> > > >>
> > >
> >
> https://hadoop.apache.org/docs/r2.9.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml#yarn.nodemanager.log.retain-seconds
> > > >>
> > > >> [4]
> > > >>
> > > >>
> > >
> >
> https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures
> > > >> [5] https://issues.apache.org/jira/browse/FLINK-14317
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Improve history server with log support

Reply via email to