Re: Prometheus metrics does not work in 1.15.0 taskmanager

Peter Schrott Wed, 04 May 2022 00:47:23 -0700

Hi Chesnay,

Thanks for that support! Just for compilation: Running the "Problem-Job"
locally as test in Intellij (as Chesney suggested above) reproduces the
described problem:


➜  ~ curl localhost:9200curl: (52) Empty reply from server

Doing the same with other jobs metrics are available on localhost:9200.

One other thing I noticed yesterday in the cluster is that job/task
specific metrics are available for a very short time after the job is
started (for around a few seconds). E.g:

# HELP flink_taskmanager_job_task_backPressuredTimeMsPerSecond
backPressuredTimeMsPerSecond (scope: taskmanager_job_task)

After all tasks are "green" in the webui, the "empty reply from server" is
back.

1)
I changed the prometheus config in my cluster, but as you saied, it does
not have any impact.

2)
For the logging in a test scenario, I also had to add the following lines
in my test class:

SLF4JBridgeHandler.removeHandlersForRootLogger()
SLF4JBridgeHandler.install()

(source: https://www.slf4j.org/api/org/slf4j/bridge/SLF4JBridgeHandler.html)
 As well as resetting log levels for jul in my logback.xml:

<contextListener class="ch.qos.logback.classic.jul.LevelChangePropagator">
    <resetJUL>true</resetJUL>
</contextListener>

This infos just for completeness, if someone else stumbles upon.

I set the following loggers to lvl TRACE:

<logger name="com.sun.net.httpserver" level="TRACE" additive="false">
    <appender-ref ref="ASYNC_FILE" />
</logger>

<logger name="org.apache.flink.metrics.prometheus" level="TRACE"
additive="false">
    <appender-ref ref="ASYNC_FILE" />
</logger>

<logger name="io.prometheus.client" level="TRACE" additive="false">
    <appender-ref ref="ASYNC_FILE" />
</logger>

When running the job in a local test as suggested above I get the following
log messages:

12701 INFO  [ScalaTest-run] com.sun.net.httpserver  - HttpServer
created http 0.0.0.0/0.0.0.0:9200
12703 INFO  [ScalaTest-run] com.sun.net.httpserver  - context created: /
12703 INFO  [ScalaTest-run] com.sun.net.httpserver  - context created: /metrics
12704 INFO  [ScalaTest-run] o.a.f.m.p.PrometheusReporter  - Started
PrometheusReporter HTTP server on port 9200.


3)
I have not tried to reproduce in a local cluster yet, as the issue is also
reproducible in the test environment. But thanks for the hint - could be
very helpful!

 __

>From the observations it does not seem like there is a problem with the
http server itself. I am just making assumptions: It feels like there is a
problem with reading and providing the metrics. As the issue
reproducible in the local setup I have the comfy option to debug in
Intellij now - I'll spend my day with this if no other hints or ideas arise.

Thanks & Best, Peter

On Tue, May 3, 2022 at 4:01 PM Chesnay Schepler <ches...@apache.org> wrote:

> > I noticed that my config of the PrometheusReporter is different here. I
> have: `metrics.reporter.prom.class:
> org.apache.flink.metrics.prometheus.PrometheusReporter`. I will investigate
> if this is a problem.
>
> That's not a problem.
>
> > Which trace logs are interesting?
>
> The logging config I provided should highlight the relevant bits
> (com.sun.net.httpserver).
> At least in my local tests this is where any interesting things were
> logged.
> Note that this part of the code uses java.util.logging, not slf4j/log4j.
>
> > When running a local flink (start-cluster.sh), I do not have a certain
> url/port to access the taskmanager, right?
>
> If you configure a port range it should be as simple as curl
> localhost:<port>.
> You can find the used port in the taskmanager logs.
> Or just try the first N ports in the range ;)
>
> On 03/05/2022 14:11, Peter Schrott wrote:
>
> Hi Chesnay,
>
> Thanks for the code snipped. Which trace logs are interesting? Of "
> org.apache.flink.metrics.prometheus.PrometheusReporter"?
> I could also add this logger settings in the environment where the problem
> is present.
>
> Other than that, I am not sure how to reproduce this issue in a local
> setup. In the cluster where the metrics are missing I am navigating to the
> certain taskmanager and try to access the metrics via the configured
> prometheus port. When running a local flink (start-cluster.sh), I do not
> have a certain url/port to access the taskmanager, right?
>
> I noticed that my config of the PrometheusReporter is different here. I
> have: `metrics.reporter.prom.class:
> org.apache.flink.metrics.prometheus.PrometheusReporter`. I will investigate
> if this is a problem.
>
> Unfortunately I can not provide my job at the moment. It
> contains business logic and it is tightly coupled with our Kafka systems. I
> will check the option of creating a sample job to reproduce the problem.
>
> Best, Peter
>
> On Tue, May 3, 2022 at 12:48 PM Chesnay Schepler <ches...@apache.org>
> wrote:
>
>> You'd help me out greatly if you could provide me with a sample job that
>> runs into the issue.
>>
>> So far I wasn't able to reproduce the issue,
>> but it should be clear that there is some given 3 separate reports,
>> although it is strange that so far it was only reported for Prometheus.
>>
>> If one of you is able to reproduce the issue within a Test and is feeling
>> adventurous,
>> then you might be able to get more information by forwarding the
>> java.util.logging
>> to SLF4J. Below is some code to get you started.
>>
>> DebuggingTest.java:
>>
>> class DebuggingTest {
>>
>>     static {
>>         LogManager.getLogManager().getLogger("").setLevel(Level.FINEST);
>>         SLF4JBridgeHandler.removeHandlersForRootLogger();
>>         SLF4JBridgeHandler.install();
>>         miniClusterExtension =
>>                 new MiniClusterExtension(
>>                         new MiniClusterResourceConfiguration.Builder()
>>                                 .setConfiguration(getConfiguration())
>>                                 .setNumberSlotsPerTaskManager(1)
>>                                 .build());
>>     }
>>
>>     @RegisterExtension private static final MiniClusterExtension 
>> miniClusterExtension;
>>
>>     private static Configuration getConfiguration() {
>>         final Configuration configuration = new Configuration();
>>
>>         configuration.setString(
>>                 "metrics.reporter.prom.factory.class", 
>> PrometheusReporterFactory.class.getName());
>>         configuration.setString("metrics.reporter.prom.port", "9200-9300");
>>
>>         return configuration;
>>     }
>>
>>     @Test
>>     void runJob() throws Exception {
>>         <run job>
>>     }
>> }
>>
>>
>> pom.xml:
>>
>> <dependency>
>>    <groupId>org.slf4j</groupId>
>>    <artifactId>jul-to-slf4j</artifactId>
>>    <version>1.7.32</version>
>> </dependency>
>>
>> log4j2-test.properties:
>>
>> rootLogger.level = off
>> rootLogger.appenderRef.test.ref = TestLogger
>> logger.http.name = com.sun.net.httpserver
>> logger.http.level = trace
>> appender.testlogger.name = TestLogger
>> appender.testlogger.type = CONSOLE
>> appender.testlogger.target = SYSTEM_ERR
>> appender.testlogger.layout.type = PatternLayout
>> appender.testlogger.layout.pattern = %-4r [%t] %-5p %c %x - %m%n
>>
>> On 03/05/2022 10:41, ChangZhuo Chen (陳昌倬) wrote:
>>
>> On Tue, May 03, 2022 at 10:32:03AM +0200, Peter Schrott wrote:
>>
>> Hi!
>>
>> I also discovered problems with the PrometheusReporter on Flink 1.15.0,
>> coming from 1.14.4. I already consulted the mailing 
>> list:https://lists.apache.org/thread/m8ohrfkrq1tqgq7lowr9p226z3yc0fgc
>> I have not found the underlying problem or a solution to it.
>>
>> Actually, after re-checking, I see the same log WARNINGS as
>> ChangZhou described.
>>
>> As I described, it seems to be an issue with my job. If no job, or an
>> example job runs on the taskmanager the basic metrics work just fine. Maybe
>> ChangZhou can confirm this?
>>
>> @ChangZhou what's your job setup? I am running a streaming SQL job, but
>> also using data streams API to create the streaming environment and from
>> that the table environment and finally using a StatementSet to execute
>> multiple SQL statements in one job.
>>
>> We are running a streaming application with low level API with
>> Kubernetes operator FlinkDeployment.
>>
>>
>>
>>
>>
>

Re: Prometheus metrics does not work in 1.15.0 taskmanager

Reply via email to