The granularity setting isn't relevant because it only matters when you enable latency metrics, but they are opt-in and the default config is used.

You can only enable/disable specific metrics in the upcoming 1.16.0.

@Yuriy: You said you had 270k Strings in the StreamConfig; is that accurate? How many StreamConfig instances are there anyhow? Asking since that is a strange number to have. I wouldn't conclude that metrics are the problem; it could just be that you're already running close to the memory budged limit, and the additional memory requirements by metrics just ever so slightly push you over it.

On 14/08/2022 10:41, yu'an huang wrote:
You can follow the ticked https://issues.apache.org/jira/browse/FLINK-10243 as mentioned in that stack overflow question to set this parameter:

“metrics.latency.granularity": https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#metrics-latency-granularity


You only have 1.688gb for your TaskManager. I also suggest you to increate the memory configuration otherwise the test may still fail.




On 12 Aug 2022, at 10:52 PM, Yuriy Kutlunin <yuriy.kutlu...@glowbyteconsulting.com> wrote:

Hello Yuan,

I don't override any default settings, docker-compose.yml:
services:
  jobmanager:
    image: flink:1.15.1-java11
    ports:
      - "8081:8081"
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager

  taskmanager:
    image: flink:1.15.1-java11
    depends_on:
      - jobmanager
    command: taskmanager
    ports:
      - "8084:8084"
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.numberOfTaskSlots: 2
        metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
        env.java.opts: -XX:+HeapDumpOnOutOfMemoryError
 From TaskManager log:
INFO  [] - Final TaskExecutor Memory configuration:
INFO  [] -   Total Process Memory:          1.688gb (1811939328 bytes)
INFO  [] -     Total Flink Memory:          1.250gb (1342177280 bytes)
INFO  [] -       Total JVM Heap Memory:     512.000mb (536870902 bytes)
INFO  [] -         Framework:               128.000mb (134217728 bytes)
INFO  [] -         Task:                    384.000mb (402653174 bytes)
INFO  [] -       Total Off-heap Memory:     768.000mb (805306378 bytes)
INFO  [] -         Managed:                 512.000mb (536870920 bytes)
INFO  [] -         Total JVM Direct Memory: 256.000mb (268435458 bytes)
INFO  [] -           Framework:             128.000mb (134217728 bytes)
INFO  [] -           Task:                  0 bytes
INFO  [] -           Network:               128.000mb (134217730 bytes)
INFO  [] -     JVM Metaspace:               256.000mb (268435456 bytes)
INFO  [] -     JVM Overhead:                192.000mb (201326592 bytes)

I would prefer not to configure memory (at this point), because memory consumption depends on job structure, so it always can exceed configured values.

My next guess is that the problem is not in metrics content, but in their number, which increases with the number of operators. So the next question is if there is a way to exclude metric generation on operator level.
Found same question without correct answer on SOF:
https://stackoverflow.com/questions/54215245/apache-flink-limit-the-amount-of-metrics-exposed

On Fri, Aug 12, 2022 at 4:05 AM yu'an huang <h.yuan...@gmail.com> wrote:
Hi Yuriy,

How do you set your TaskMananger Memory? I think 40MB is not significant high for Flink. And It’s normal to see memory increase if you have more parallelism or set another metrics on. You can try setting larger moratory for Flink as explained by following documents.

https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup/

Best
Yuan



On 12 Aug 2022, at 12:51 AM, Yuriy Kutlunin <yuriy.kutlu...@glowbyteconsulting.com> wrote:

Hi all,

I'm running Flink Cluster in Session Mode via docker-compose as stated in docs:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/docker/#session-cluster-yml

After submitting a test job with many intermediate SQL operations (~500 select * from ...) and metrics turned on (JMX or Prometheus) I got OOM: java heap space on initialization stage.

Turning metrics off allows the job to get to the Running state.
Heap consumption also depends on parallelism - same job succeeds when submitted with parallelism 1 instead of 2.

There are Task Manager logs for 4 cases:
JMX parallelism 1 (succeeded)
JMX parallelism 2 (failed)
Prometheus parallelism 2 (failed)
No metrics parallelism 2 (succeeded)

Post OOM heap dump (JMX parallelism 2) shows 2 main consumption points:
1. Big value (40MB) for some task configuration
2. Many instances (~270k) of some heavy (20KB) value in StreamConfig

Seems like all these heavy values are related to weird task names, which includes all the operations: Received task Source: source1 -> SourceConversion[2001] -> mapping1 -> SourceConversion[2003] -> mapping2 -> SourceConversion[2005] -> ... -> mapping500 -> Sink: sink1 (1/1)#0 (1e089cf3b1581ea7c8fb1cd7b159e66b)

Looking for some way to overcome this heap issue.

--
Best regards,
Yuriy Kutlunin
<many_operators_parallelism_1_with_jmx.txt><many_operators_parallelism_2_with_jmx.txt><many_operators_parallelism_2_no_jmx.txt><many_operators_parallelism_2_with_prom.txt><heap_total.png><heap_task2_conf.png><heap_many_string_instances.png><heap_task1_conf.png>



--
Best regards,
Yuriy Kutlunin

Reply via email to