Hi Tony, thanks for troubleshooting this. I have added a commit to https://github.com/apache/flink/pull/4586 that should enable you to use the reporter with 1.3.2 as well.
Best regards, Max > Tony Wei <mailto:tony19920...@gmail.com> > 23. September 2017 um 13:11 > Hi Chesnay, > > I built another flink cluster using version 1.4, set the log level to > DEBUG, and I found that the root cause might be this > exception: *java.lang.NullPointerException: Value returned by gauge > lastCheckpointExternalPath was null*. > > I updated `CheckpointStatsTracker` to ignore external path when it is > null, and this exception didn't happen again. The prometheus reporter > works as well. > > I have created a Jira issue for > it: https://issues.apache.org/jira/browse/FLINK-7675 > <https://issues.apache.org/jira/browse/FLINK-7675.>, and I will submit > the PR after I passed Travis CI for my repository. > > Best Regards, > Tony Wei > > > > > Tony Wei <mailto:tony19920...@gmail.com> > 22. September 2017 um 16:20 > Hi Chesnay, > > I didn't try it in 1.4, so I have no idea if this also occurs in 1.4. > For my setting for logging, It have already set to INFO level, but > there wasn't any error or warning in log file as well. > > Best Regards, > Tony Wei > > > Chesnay Schepler <mailto:ches...@apache.org> > 22. September 2017 um 16:07 > The Prometheus reporter should work with 1.3.2. > > Does this also occur with the reporter that currently exists in 1.4? > (to rule out new bugs from the PR). > > To investigate this further, please set the logging level to WARN and > try again, as all errors in the metric system are logged on that level. > > On 22.09.2017 10:33, Tony Wei wrote: > > > Tony Wei <mailto:tony19920...@gmail.com> > 22. September 2017 um 10:33 > Hi, > > I have built the Prometheus reporter package from this > PR https://github.com/apache/flink/pull/4586, and used it on Flink > 1.3.2 to record every default metrics and those from `FlinkKafkaConsumer`. > > Originally, everything was fine. I could get those metrics in TM from > Prometheus just like I saw on Flink Web UI. > However, when I turned to JM, I found Prometheus gives this error to > me: Get http://localhost:9249/metrics: EOF. > I checked the log on JM and saw nothing in it. There was no error > message and 9249 port was still alive. > > To figure out what happened, I created another cluster and I found > Prometheus could connect to Flink cluster if there is no running job. > After JM triggered or completed the first checkpoint, Prometheus > started getting ERR_EMPTY_RESPONSE from JM, but not for TM. There was > still no error in log file and 9249 port was still alive. > > I was wondering where did the error occur. Flink or Prometheus reporter? > Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you. > > Best Regards, > Tony Wei
signature.asc
Description: OpenPGP digital signature