> I noticed that my config of the PrometheusReporter is different here. I have: `metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter`. I will investigate if this is a problem.

That's not a problem.

> Which trace logs are interesting?

The logging config I provided should highlight the relevant bits (com.sun.net.httpserver). At least in my local tests this is where any interesting things were logged.
Note that this part of the code uses java.util.logging, not slf4j/log4j.

> When running a local flink (start-cluster.sh), I do not have a certain url/port to access the taskmanager, right?

If you configure a port range it should be as simple as curl localhost:<port>.
You can find the used port in the taskmanager logs.
Or just try the first N ports in the range ;)

On 03/05/2022 14:11, Peter Schrott wrote:
Hi Chesnay,

Thanks for the code snipped. Which trace logs are interesting? Of "org.apache.flink.metrics.prometheus.PrometheusReporter"? I could also add this logger settings in the environment where the problem is present.

Other than that, I am not sure how to reproduce this issue in a local setup. In the cluster where the metrics are missing I am navigating to the certain taskmanager and try to access the metrics via the configured prometheus port. When running a local flink (start-cluster.sh), I do not have a certain url/port to access the taskmanager, right?

I noticed that my config of the PrometheusReporter is different here. I have: `metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter`. I will investigate if this is a problem.

Unfortunately I can not provide my job at the moment. It contains business logic and it is tightly coupled with our Kafka systems. I will check the option of creating a sample job to reproduce the problem.

Best, Peter

On Tue, May 3, 2022 at 12:48 PM Chesnay Schepler <ches...@apache.org> wrote:

    You'd help me out greatly if you could provide me with a sample
    job that runs into the issue.

    So far I wasn't able to reproduce the issue,
    but it should be clear that there is some given 3 separate reports,
    although it is strange that so far it was only reported for
    Prometheus.

    If one of you is able to reproduce the issue within a Test and is
    feeling adventurous,
    then you might be able to get more information by forwarding the
    java.util.logging
    to SLF4J. Below is some code to get you started.

    DebuggingTest.java:

    class DebuggingTest {

         static {
             LogManager.getLogManager().getLogger("").setLevel(Level.FINEST);
             SLF4JBridgeHandler.removeHandlersForRootLogger();
             SLF4JBridgeHandler.install();
             miniClusterExtension =
                     new MiniClusterExtension(
                             new MiniClusterResourceConfiguration.Builder()
                                     .setConfiguration(getConfiguration())
                                     .setNumberSlotsPerTaskManager(1)
                                     .build());
         }

         @RegisterExtension private static final MiniClusterExtension 
miniClusterExtension;

         private static Configuration getConfiguration() {
             final Configuration configuration = new Configuration();

             configuration.setString(
                     "metrics.reporter.prom.factory.class", 
PrometheusReporterFactory.class.getName());
             configuration.setString("metrics.reporter.prom.port", "9200-9300");

             return configuration;
         }

         @Test
         void runJob() throws Exception {
             <run job>
         }
    }


    pom.xml:

    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>jul-to-slf4j</artifactId>
        <version>1.7.32</version>
    </dependency>
    log4j2-test.properties:

    rootLogger.level = off
    rootLogger.appenderRef.test.ref = TestLogger

    logger.http.name  <http://logger.http.name>  = com.sun.net.httpserver
    logger.http.level = trace

    appender.testlogger.name  <http://appender.testlogger.name>  = TestLogger
    appender.testlogger.type = CONSOLE
    appender.testlogger.target = SYSTEM_ERR
    appender.testlogger.layout.type = PatternLayout
    appender.testlogger.layout.pattern = %-4r [%t] %-5p %c %x - %m%n

    On 03/05/2022 10:41, ChangZhuo Chen (陳昌倬) wrote:
    On Tue, May 03, 2022 at 10:32:03AM +0200, Peter Schrott wrote:
    Hi!

    I also discovered problems with the PrometheusReporter on Flink 1.15.0,
    coming from 1.14.4. I already consulted the mailing list:
    https://lists.apache.org/thread/m8ohrfkrq1tqgq7lowr9p226z3yc0fgc
    I have not found the underlying problem or a solution to it.

    Actually, after re-checking, I see the same log WARNINGS as
    ChangZhou described.

    As I described, it seems to be an issue with my job. If no job, or an
    example job runs on the taskmanager the basic metrics work just fine. Maybe
    ChangZhou can confirm this?

    @ChangZhou what's your job setup? I am running a streaming SQL job, but
    also using data streams API to create the streaming environment and from
    that the table environment and finally using a StatementSet to execute
    multiple SQL statements in one job.
    We are running a streaming application with low level API with
    Kubernetes operator FlinkDeployment.



Reply via email to