Hi Chesnay, Thanks for that support! Just for compilation: Running the "Problem-Job" locally as test in Intellij (as Chesney suggested above) reproduces the described problem:
➜ ~ curl localhost:9200curl: (52) Empty reply from server Doing the same with other jobs metrics are available on localhost:9200. One other thing I noticed yesterday in the cluster is that job/task specific metrics are available for a very short time after the job is started (for around a few seconds). E.g: # HELP flink_taskmanager_job_task_backPressuredTimeMsPerSecond backPressuredTimeMsPerSecond (scope: taskmanager_job_task) After all tasks are "green" in the webui, the "empty reply from server" is back. 1) I changed the prometheus config in my cluster, but as you saied, it does not have any impact. 2) For the logging in a test scenario, I also had to add the following lines in my test class: SLF4JBridgeHandler.removeHandlersForRootLogger() SLF4JBridgeHandler.install() (source: https://www.slf4j.org/api/org/slf4j/bridge/SLF4JBridgeHandler.html) As well as resetting log levels for jul in my logback.xml: <contextListener class="ch.qos.logback.classic.jul.LevelChangePropagator"> <resetJUL>true</resetJUL> </contextListener> This infos just for completeness, if someone else stumbles upon. I set the following loggers to lvl TRACE: <logger name="com.sun.net.httpserver" level="TRACE" additive="false"> <appender-ref ref="ASYNC_FILE" /> </logger> <logger name="org.apache.flink.metrics.prometheus" level="TRACE" additive="false"> <appender-ref ref="ASYNC_FILE" /> </logger> <logger name="io.prometheus.client" level="TRACE" additive="false"> <appender-ref ref="ASYNC_FILE" /> </logger> When running the job in a local test as suggested above I get the following log messages: 12701 INFO [ScalaTest-run] com.sun.net.httpserver - HttpServer created http 0.0.0.0/0.0.0.0:9200 12703 INFO [ScalaTest-run] com.sun.net.httpserver - context created: / 12703 INFO [ScalaTest-run] com.sun.net.httpserver - context created: /metrics 12704 INFO [ScalaTest-run] o.a.f.m.p.PrometheusReporter - Started PrometheusReporter HTTP server on port 9200. 3) I have not tried to reproduce in a local cluster yet, as the issue is also reproducible in the test environment. But thanks for the hint - could be very helpful! __ >From the observations it does not seem like there is a problem with the http server itself. I am just making assumptions: It feels like there is a problem with reading and providing the metrics. As the issue reproducible in the local setup I have the comfy option to debug in Intellij now - I'll spend my day with this if no other hints or ideas arise. Thanks & Best, Peter On Tue, May 3, 2022 at 4:01 PM Chesnay Schepler <ches...@apache.org> wrote: > > I noticed that my config of the PrometheusReporter is different here. I > have: `metrics.reporter.prom.class: > org.apache.flink.metrics.prometheus.PrometheusReporter`. I will investigate > if this is a problem. > > That's not a problem. > > > Which trace logs are interesting? > > The logging config I provided should highlight the relevant bits > (com.sun.net.httpserver). > At least in my local tests this is where any interesting things were > logged. > Note that this part of the code uses java.util.logging, not slf4j/log4j. > > > When running a local flink (start-cluster.sh), I do not have a certain > url/port to access the taskmanager, right? > > If you configure a port range it should be as simple as curl > localhost:<port>. > You can find the used port in the taskmanager logs. > Or just try the first N ports in the range ;) > > On 03/05/2022 14:11, Peter Schrott wrote: > > Hi Chesnay, > > Thanks for the code snipped. Which trace logs are interesting? Of " > org.apache.flink.metrics.prometheus.PrometheusReporter"? > I could also add this logger settings in the environment where the problem > is present. > > Other than that, I am not sure how to reproduce this issue in a local > setup. In the cluster where the metrics are missing I am navigating to the > certain taskmanager and try to access the metrics via the configured > prometheus port. When running a local flink (start-cluster.sh), I do not > have a certain url/port to access the taskmanager, right? > > I noticed that my config of the PrometheusReporter is different here. I > have: `metrics.reporter.prom.class: > org.apache.flink.metrics.prometheus.PrometheusReporter`. I will investigate > if this is a problem. > > Unfortunately I can not provide my job at the moment. It > contains business logic and it is tightly coupled with our Kafka systems. I > will check the option of creating a sample job to reproduce the problem. > > Best, Peter > > On Tue, May 3, 2022 at 12:48 PM Chesnay Schepler <ches...@apache.org> > wrote: > >> You'd help me out greatly if you could provide me with a sample job that >> runs into the issue. >> >> So far I wasn't able to reproduce the issue, >> but it should be clear that there is some given 3 separate reports, >> although it is strange that so far it was only reported for Prometheus. >> >> If one of you is able to reproduce the issue within a Test and is feeling >> adventurous, >> then you might be able to get more information by forwarding the >> java.util.logging >> to SLF4J. Below is some code to get you started. >> >> DebuggingTest.java: >> >> class DebuggingTest { >> >> static { >> LogManager.getLogManager().getLogger("").setLevel(Level.FINEST); >> SLF4JBridgeHandler.removeHandlersForRootLogger(); >> SLF4JBridgeHandler.install(); >> miniClusterExtension = >> new MiniClusterExtension( >> new MiniClusterResourceConfiguration.Builder() >> .setConfiguration(getConfiguration()) >> .setNumberSlotsPerTaskManager(1) >> .build()); >> } >> >> @RegisterExtension private static final MiniClusterExtension >> miniClusterExtension; >> >> private static Configuration getConfiguration() { >> final Configuration configuration = new Configuration(); >> >> configuration.setString( >> "metrics.reporter.prom.factory.class", >> PrometheusReporterFactory.class.getName()); >> configuration.setString("metrics.reporter.prom.port", "9200-9300"); >> >> return configuration; >> } >> >> @Test >> void runJob() throws Exception { >> <run job> >> } >> } >> >> >> pom.xml: >> >> <dependency> >> <groupId>org.slf4j</groupId> >> <artifactId>jul-to-slf4j</artifactId> >> <version>1.7.32</version> >> </dependency> >> >> log4j2-test.properties: >> >> rootLogger.level = off >> rootLogger.appenderRef.test.ref = TestLogger >> logger.http.name = com.sun.net.httpserver >> logger.http.level = trace >> appender.testlogger.name = TestLogger >> appender.testlogger.type = CONSOLE >> appender.testlogger.target = SYSTEM_ERR >> appender.testlogger.layout.type = PatternLayout >> appender.testlogger.layout.pattern = %-4r [%t] %-5p %c %x - %m%n >> >> On 03/05/2022 10:41, ChangZhuo Chen (陳昌倬) wrote: >> >> On Tue, May 03, 2022 at 10:32:03AM +0200, Peter Schrott wrote: >> >> Hi! >> >> I also discovered problems with the PrometheusReporter on Flink 1.15.0, >> coming from 1.14.4. I already consulted the mailing >> list:https://lists.apache.org/thread/m8ohrfkrq1tqgq7lowr9p226z3yc0fgc >> I have not found the underlying problem or a solution to it. >> >> Actually, after re-checking, I see the same log WARNINGS as >> ChangZhou described. >> >> As I described, it seems to be an issue with my job. If no job, or an >> example job runs on the taskmanager the basic metrics work just fine. Maybe >> ChangZhou can confirm this? >> >> @ChangZhou what's your job setup? I am running a streaming SQL job, but >> also using data streams API to create the streaming environment and from >> that the table environment and finally using a StatementSet to execute >> multiple SQL statements in one job. >> >> We are running a streaming application with low level API with >> Kubernetes operator FlinkDeployment. >> >> >> >> >> >