@Xiaogang Could please share more details about the trace mechanism you mentioned. As Rong mentioned, we are also working on something similar.
On Fri, Feb 14, 2020, 9:12 AM Rong Rong <walter...@gmail.com> wrote: > Thank you for the prompt feedbacks > > @Aljoscha. Yes you are absolutely correct - adding Hadoop dependency to > cluster runtime component is definitely not what we are proposing. > We were trying to see how the community thinks about the idea of adding log > support into History server. > - The reference to this JIRA ticket is more on the intention rather than > the solution. - in fact the intention is slightly different, we were > trying to put it in the history server while the original JIRA proposed to > add it in the live runtime modules. > - IMO, in order to support different cluster environments: the generic > cluster component should only provide an interface, where each cluster impl > module should extend from. > > > @Xiaogang, thank you for bringing up the idea of utilizing a trace system. > > The event tracing would definitely provide additional, in fact more > valuable information for debugging purposes. > In fact we were also internally experimenting with the idea similar to > Spark's ListenerInterface [1] to capture some of the important messages > sent via akka. > But we are still in a very early preliminary stage, thus we haven't > included them in this discussion. > > We would love to hear more regarding the trace system you proposed. could > you share more information regarding this? > Such as how would the live events being listened; how would the trace being > collected/stored; etc. > > > [1] > > https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/scheduler/SparkListener.html > > Thanks, > Rong > > > On Thu, Feb 13, 2020 at 7:33 AM Aljoscha Krettek <aljos...@apache.org> > wrote: > > > Hi, > > > > what's the difference in approach to the mentioned related Jira Issue > > ([1])? I commented there because I'm skeptical about adding > > Hadoop-specific code to the generic cluster components. > > > > Best, > > Aljoscha > > > > [1] https://issues.apache.org/jira/browse/FLINK-14317 > > > > On 13.02.20 03:47, SHI Xiaogang wrote: > > > Hi Rong Rong, > > > > > > Thanks for the proposal. We are also suffering from some pains brought > by > > > history server. To address them, we propose a trace system, which is > very > > > similar to the metric system, for historical information. > > > > > > A trace is semi-structured information about events in Flink. Useful > > traces > > > include: > > > * job traces: which contain the job graph of submitted jobs. > > > * schedule traces: A schedule trace is typically composed of the > > > information of task slots. They are generated when a job finishes, > fails, > > > or is canceled. As a job may restart mutliple times, a job typically > has > > > multiple schedule traces. > > > * checkpoint traces: which are generated when a checkpoint completes or > > > fails. > > > * task manager traces: which are generated when a task manager > > terminates. > > > Users can access the link to aggregated logs intaskmanager traces. > > > > > > Users can use TraceReport to collect traces in Flink and export them to > > > external storage (e.g., ElasticSearch). By retrieving traces when > > > exceptions happen, we can improve user experience in altering. > > > > > > Regards, > > > Xiaogang > > > > > > Rong Rong <walter...@gmail.com> 于2020年2月13日周四 上午9:41写道: > > > > > >> Hi All, > > >> > > >> Recently we have been experimenting using Flink’s history server as a > > >> centralized debugging service for completed streaming jobs. > > >> > > >> Specifically, we dynamically generate links to access log files on the > > YARN > > >> host; in the meantime, we use the Flink history server to show job > > graphs, > > >> exceptions and other info of the completed jobs[2]. > > >> > > >> This causes some pain for our users, namely: It is inconvenient to go > to > > >> YARN host to access logs; then go to Flink history server for the > other > > >> information. > > >> > > >> Thus we would like to propose an improvement to the currently Flink > > history > > >> server: > > >> > > >> - > > >> > > >> To support dynamic links to residual log files from the host > machine > > >> within the retention period [3]; > > >> - > > >> > > >> To support dynamic links to aggregated log files provided by the > > >> cluster, if supported: such as Hadoop HistoryServer[1], or > > Kubernetes > > >> cluster level logging[4]? > > >> - > > >> > > >> Similar integration with Hadoop HistoryServer was already > > proposed > > >> before[5] with slightly different approach. > > >> > > >> > > >> Any feedback and suggestions are highly appreciated! > > >> > > >> -- > > >> > > >> Rong > > >> > > >> [1] > > >> > > >> > > > https://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html > > >> > > >> [2] > > >> > > >> > > > https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/historyserver.html > > >> > > >> [3] > > >> > > >> > > > https://hadoop.apache.org/docs/r2.9.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml#yarn.nodemanager.log.retain-seconds > > >> > > >> [4] > > >> > > >> > > > https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures > > >> [5] https://issues.apache.org/jira/browse/FLINK-14317 > > >> > > > > > >