@Xiaogang Could please share more details about the trace mechanism you
mentioned. As Rong mentioned, we are also working on something similar.

On Fri, Feb 14, 2020, 9:12 AM Rong Rong <walter...@gmail.com> wrote:

> Thank you for the prompt feedbacks
>
> @Aljoscha. Yes you are absolutely correct - adding Hadoop dependency to
> cluster runtime component is definitely not what we are proposing.
> We were trying to see how the community thinks about the idea of adding log
> support into History server.
>   - The reference to this JIRA ticket is more on the intention rather than
> the solution. -  in fact the intention is slightly different, we were
> trying to put it in the history server while the original JIRA proposed to
> add it in the live runtime modules.
>   - IMO, in order to support different cluster environments: the generic
> cluster component should only provide an interface, where each cluster impl
> module should extend from.
>
>
> @Xiaogang, thank you for bringing up the idea of utilizing a trace system.
>
> The event tracing would definitely provide additional, in fact more
> valuable information for debugging purposes.
> In fact we were also internally experimenting with the idea similar to
> Spark's ListenerInterface [1] to capture some of the important messages
> sent via akka.
> But we are still in a very early preliminary stage, thus we haven't
> included them in this discussion.
>
> We would love to hear more regarding the trace system you proposed. could
> you share more information regarding this?
> Such as how would the live events being listened; how would the trace being
> collected/stored; etc.
>
>
> [1]
>
> https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/scheduler/SparkListener.html
>
> Thanks,
> Rong
>
>
> On Thu, Feb 13, 2020 at 7:33 AM Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > Hi,
> >
> > what's the difference in approach to the mentioned related Jira Issue
> > ([1])? I commented there because I'm skeptical about adding
> > Hadoop-specific code to the generic cluster components.
> >
> > Best,
> > Aljoscha
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-14317
> >
> > On 13.02.20 03:47, SHI Xiaogang wrote:
> > > Hi Rong Rong,
> > >
> > > Thanks for the proposal. We are also suffering from some pains brought
> by
> > > history server. To address them, we propose a trace system, which is
> very
> > > similar to the metric system, for historical information.
> > >
> > > A trace is semi-structured information about events in Flink. Useful
> > traces
> > > include:
> > > * job traces: which contain the job graph of submitted jobs.
> > > * schedule traces: A schedule trace is typically composed of the
> > > information of task slots. They are generated when a job finishes,
> fails,
> > > or is canceled. As a job may restart mutliple times, a job typically
> has
> > > multiple schedule traces.
> > > * checkpoint traces: which are generated when a checkpoint completes or
> > > fails.
> > > * task manager traces: which are generated when a task manager
> > terminates.
> > > Users can access the link to aggregated logs intaskmanager traces.
> > >
> > > Users can use TraceReport to collect traces in Flink and export them to
> > > external storage (e.g., ElasticSearch). By retrieving traces when
> > > exceptions happen, we can improve user experience in altering.
> > >
> > > Regards,
> > > Xiaogang
> > >
> > > Rong Rong <walter...@gmail.com> 于2020年2月13日周四 上午9:41写道:
> > >
> > >> Hi All,
> > >>
> > >> Recently we have been experimenting using Flink’s history server as a
> > >> centralized debugging service for completed streaming jobs.
> > >>
> > >> Specifically, we dynamically generate links to access log files on the
> > YARN
> > >> host; in the meantime, we use the Flink history server to show job
> > graphs,
> > >> exceptions and other info of the completed jobs[2].
> > >>
> > >> This causes some pain for our users, namely: It is inconvenient to go
> to
> > >> YARN host to access logs; then go to Flink history server for the
> other
> > >> information.
> > >>
> > >> Thus we would like to propose an improvement to the currently Flink
> > history
> > >> server:
> > >>
> > >>     -
> > >>
> > >>     To support dynamic links to residual log files from the host
> machine
> > >>     within the retention period [3];
> > >>     -
> > >>
> > >>     To support dynamic links to aggregated log files provided by the
> > >>     cluster, if supported: such as Hadoop HistoryServer[1], or
> > Kubernetes
> > >>     cluster level logging[4]?
> > >>     -
> > >>
> > >>        Similar integration with Hadoop HistoryServer was already
> > proposed
> > >>        before[5] with slightly different approach.
> > >>
> > >>
> > >> Any feedback and suggestions are highly appreciated!
> > >>
> > >> --
> > >>
> > >> Rong
> > >>
> > >> [1]
> > >>
> > >>
> >
> https://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html
> > >>
> > >> [2]
> > >>
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/historyserver.html
> > >>
> > >> [3]
> > >>
> > >>
> >
> https://hadoop.apache.org/docs/r2.9.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml#yarn.nodemanager.log.retain-seconds
> > >>
> > >> [4]
> > >>
> > >>
> >
> https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures
> > >> [5] https://issues.apache.org/jira/browse/FLINK-14317
> > >>
> > >
> >
>

Reply via email to