Re: Metrics OOM java heap space

Chesnay Schepler Mon, 15 Aug 2022 01:00:56 -0700

The granularity setting isn't relevant because it only matters when youenable latency metrics, but they are opt-in and the default config is used.


You can only enable/disable specific metrics in the upcoming 1.16.0.

@Yuriy: You said you had 270k Strings in the StreamConfig; is thataccurate? How many StreamConfig instances are there anyhow? Asking sincethat is a strange number to have.I wouldn't conclude that metrics are the problem; it could just be thatyou're already running close to the memory budged limit, and theadditional memory requirements by metrics just ever so slightly push youover it.


On 14/08/2022 10:41, yu'an huang wrote:

You can follow the tickedhttps://issues.apache.org/jira/browse/FLINK-10243 as mentioned in thatstack overflow question to set this parameter:
“metrics.latency.granularity":https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#metrics-latency-granularity
You only have 1.688gb for your TaskManager. I also suggest you toincreate the memory configuration otherwise the test may still fail.
On 12 Aug 2022, at 10:52 PM, Yuriy Kutlunin<yuriy.kutlu...@glowbyteconsulting.com> wrote:
Hello Yuan,

I don't override any default settings, docker-compose.yml:
services:
  jobmanager:
    image: flink:1.15.1-java11
    ports:
      - "8081:8081"
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager

  taskmanager:
    image: flink:1.15.1-java11
    depends_on:
      - jobmanager
    command: taskmanager
    ports:
      - "8084:8084"
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.numberOfTaskSlots: 2
metrics.reporter.prom.class:org.apache.flink.metrics.prometheus.PrometheusReporter
        env.java.opts: -XX:+HeapDumpOnOutOfMemoryError
 From TaskManager log:
INFO  [] - Final TaskExecutor Memory configuration:
INFO  [] -   Total Process Memory:          1.688gb (1811939328 bytes)
INFO  [] -     Total Flink Memory:          1.250gb (1342177280 bytes)
INFO  [] -       Total JVM Heap Memory:     512.000mb (536870902 bytes)
INFO  [] -         Framework:               128.000mb (134217728 bytes)
INFO  [] -         Task:                    384.000mb (402653174 bytes)
INFO  [] -       Total Off-heap Memory:     768.000mb (805306378 bytes)
INFO  [] -         Managed:                 512.000mb (536870920 bytes)
INFO  [] -         Total JVM Direct Memory: 256.000mb (268435458 bytes)
INFO  [] -           Framework:             128.000mb (134217728 bytes)
INFO  [] -           Task:                  0 bytes
INFO  [] -           Network:               128.000mb (134217730 bytes)
INFO  [] -     JVM Metaspace:               256.000mb (268435456 bytes)
INFO  [] -     JVM Overhead:                192.000mb (201326592 bytes)
I would prefer not to configure memory (at this point), becausememory consumption depends on job structure, so it always can exceedconfigured values.
My next guess is that the problem is not in metrics content, but intheir number, which increases with the number of operators.So the next question is if there is a way to exclude metricgeneration on operator level.
Found same question without correct answer on SOF:
https://stackoverflow.com/questions/54215245/apache-flink-limit-the-amount-of-metrics-exposed

On Fri, Aug 12, 2022 at 4:05 AM yu'an huang <h.yuan...@gmail.com> wrote:
Hi Yuriy,
How do you set your TaskMananger Memory? I think 40MB is notsignificant high for Flink. And It’s normal to see memory increase ifyou have more parallelism or set another metrics on. You can trysetting larger moratory for Flink as explained by following documents.
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup/

Best
Yuan
On 12 Aug 2022, at 12:51 AM, Yuriy Kutlunin<yuriy.kutlu...@glowbyteconsulting.com> wrote:
Hi all,
I'm running Flink Cluster in Session Mode via docker-compose asstated in docs:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/docker/#session-cluster-yml
After submitting a test job with many intermediate SQL operations(~500 select * from ...) and metrics turned on (JMX or Prometheus) Igot OOM: java heap space on initialization stage.
Turning metrics off allows the job to get to the Running state.
Heap consumption also depends on parallelism - same job succeedswhen submitted with parallelism 1 instead of 2.
There are Task Manager logs for 4 cases:
JMX parallelism 1 (succeeded)
JMX parallelism 2 (failed)
Prometheus parallelism 2 (failed)
No metrics parallelism 2 (succeeded)

Post OOM heap dump (JMX parallelism 2) shows 2 main consumption points:
1. Big value (40MB) for some task configuration
2. Many instances (~270k) of some heavy (20KB) value in StreamConfig
Seems like all these heavy values are related to weird task names,which includes all the operations:Received task Source: source1 -> SourceConversion[2001] -> mapping1-> SourceConversion[2003] -> mapping2 -> SourceConversion[2005] ->... -> mapping500 -> Sink: sink1(1/1)#0 (1e089cf3b1581ea7c8fb1cd7b159e66b)
Looking for some way to overcome this heap issue.

--
Best regards,
Yuriy Kutlunin
<many_operators_parallelism_1_with_jmx.txt><many_operators_parallelism_2_with_jmx.txt><many_operators_parallelism_2_no_jmx.txt><many_operators_parallelism_2_with_prom.txt><heap_total.png><heap_task2_conf.png><heap_many_string_instances.png><heap_task1_conf.png>
--
Best regards,
Yuriy Kutlunin

Re: Metrics OOM java heap space

Reply via email to