Re: Prometheus metrics does not work in 1.15.0 taskmanager

Chesnay Schepler Tue, 03 May 2022 07:01:30 -0700

> I noticed that my config of the PrometheusReporter is different here.I have: `metrics.reporter.prom.class:org.apache.flink.metrics.prometheus.PrometheusReporter`. I willinvestigate if this is a problem.


That's not a problem.


> Which trace logs are interesting?

The logging config I provided should highlight the relevant bits(com.sun.net.httpserver).At least in my local tests this is where any interesting things werelogged.

Note that this part of the code uses java.util.logging, not slf4j/log4j.

> When running a local flink (start-cluster.sh), I do not have acertain url/port to access the taskmanager, right?

If you configure a port range it should be as simple as curllocalhost:<port>.

You can find the used port in the taskmanager logs.
Or just try the first N ports in the range ;)

On 03/05/2022 14:11, Peter Schrott wrote:

Hi Chesnay,

Thanks for the code snipped. Which trace logs are interesting? Of"org.apache.flink.metrics.prometheus.PrometheusReporter"?I could also add this logger settings in the environment where theproblem is present.

Other than that, I am not sure how to reproduce this issue in a localsetup. In the cluster where the metrics are missing I am navigating tothe certain taskmanager and try to access the metrics via theconfigured prometheus port. When running a local flink(start-cluster.sh), I do not have a certain url/port to access thetaskmanager, right?

I noticed that my config of the PrometheusReporter is different here.I have: `metrics.reporter.prom.class:org.apache.flink.metrics.prometheus.PrometheusReporter`. I willinvestigate if this is a problem.

Unfortunately I can not provide my job at the moment. Itcontains business logic and it is tightly coupled with our Kafkasystems. I will check the option of creating a sample job to reproducethe problem.


Best, Peter

On Tue, May 3, 2022 at 12:48 PM Chesnay Schepler <ches...@apache.org>wrote:


    You'd help me out greatly if you could provide me with a sample
    job that runs into the issue.

    So far I wasn't able to reproduce the issue,
    but it should be clear that there is some given 3 separate reports,
    although it is strange that so far it was only reported for
    Prometheus.

    If one of you is able to reproduce the issue within a Test and is
    feeling adventurous,
    then you might be able to get more information by forwarding the
    java.util.logging
    to SLF4J. Below is some code to get you started.

    DebuggingTest.java:

    class DebuggingTest {

         static {
             LogManager.getLogManager().getLogger("").setLevel(Level.FINEST);
             SLF4JBridgeHandler.removeHandlersForRootLogger();
             SLF4JBridgeHandler.install();
             miniClusterExtension =
                     new MiniClusterExtension(
                             new MiniClusterResourceConfiguration.Builder()
                                     .setConfiguration(getConfiguration())
                                     .setNumberSlotsPerTaskManager(1)
                                     .build());
         }

         @RegisterExtension private static final MiniClusterExtension 
miniClusterExtension;

         private static Configuration getConfiguration() {
             final Configuration configuration = new Configuration();

             configuration.setString(
                     "metrics.reporter.prom.factory.class", 
PrometheusReporterFactory.class.getName());
             configuration.setString("metrics.reporter.prom.port", "9200-9300");

             return configuration;
         }

         @Test
         void runJob() throws Exception {
             <run job>
         }
    }


    pom.xml:

    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>jul-to-slf4j</artifactId>
        <version>1.7.32</version>
    </dependency>
    log4j2-test.properties:

    rootLogger.level = off
    rootLogger.appenderRef.test.ref = TestLogger

    logger.http.name  <http://logger.http.name>  = com.sun.net.httpserver
    logger.http.level = trace

    appender.testlogger.name  <http://appender.testlogger.name>  = TestLogger
    appender.testlogger.type = CONSOLE
    appender.testlogger.target = SYSTEM_ERR
    appender.testlogger.layout.type = PatternLayout
    appender.testlogger.layout.pattern = %-4r [%t] %-5p %c %x - %m%n

    On 03/05/2022 10:41, ChangZhuo Chen (陳昌倬) wrote:

    On Tue, May 03, 2022 at 10:32:03AM +0200, Peter Schrott wrote:

    Hi!

    I also discovered problems with the PrometheusReporter on Flink 1.15.0,
    coming from 1.14.4. I already consulted the mailing list:
    https://lists.apache.org/thread/m8ohrfkrq1tqgq7lowr9p226z3yc0fgc
    I have not found the underlying problem or a solution to it.

    Actually, after re-checking, I see the same log WARNINGS as
    ChangZhou described.

    As I described, it seems to be an issue with my job. If no job, or an
    example job runs on the taskmanager the basic metrics work just fine. Maybe
    ChangZhou can confirm this?

    @ChangZhou what's your job setup? I am running a streaming SQL job, but
    also using data streams API to create the streaming environment and from
    that the table environment and finally using a StatementSet to execute
    multiple SQL statements in one job.

    We are running a streaming application with low level API with
    Kubernetes operator FlinkDeployment.

Re: Prometheus metrics does not work in 1.15.0 taskmanager

Reply via email to