Hi, The pushgateway uses push mode to report metrics. When deployed on a single machine under high load, there may be some performance issues. A simple solution is to set up multiple pushgateways and push the metrics to different pushgateways based on different task groups.
There are other metrics reporters available based on the push model, such as InfluxDB[1]. In a clustered mode, InfluxDB may offer better performance than pushgateway. You can try using InfluxDB as an alternative and evaluate its performance. I speculate that the reason for using pushgateway is because when running Flink with YARN application or per job mode, the task ports are randomized, making it difficult for prometheus to determine which task to scrape. By the way, if you deploy tasks using the flink kubernetes operator, you can directly use the prometheus metrics reporter without the need for pushgateway[2]. Best, Jiabao [1] https://nightlies.apache.org/flink/flink-docs-master/zh/docs/deployment/metric_reporters/#influxdb [2] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#how-to-enable-prometheus-example On 2023/12/12 08:23:22 李琳 wrote: > hello, > we build flink report metrics to prometheus pushgateway, the program has > been running for a period of time, with a amount of data reported to > pushgateway, pushgateway response socket timeout exception, and much of > metrics data reported failed. following is the exception: > > > 2023-12-12 04:13:07,812 WARN > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter [] - Failed > to push metrics to PushGateway with jobName > 00034937_20231211200917_54ede15602bb8704c3a98ec481bea96, groupingKey{}. > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream. socketRead(Native Method) ~[?:1.8.0_281] > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > ~[?:1.8.0 281] > at java.net.SocketInputStream.read(SocketInputStream. java:171) ~[?:1.8.0 > 281] at java.net.SocketInputStream.read(SocketInputStream. java:141) > ~[?:1.8.0 2811 > at java.io.BufferedInputStream.fill (BufferedInputStream. java:246) ~[?:1.8.0 > 2811 at java.io. BufferedInputStream.read1(BufferedInputStream.java:286) > ~[?:1.8.0_281] at > java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0 281] > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) > ~[?:1.8.0_281] at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > ~[?:1.8.0_281] at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593) > ~[?:1.8.0_281] at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) > ~[?:1.8.0 2811 at > java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)~[?:1.8.0_281] > at > io.prometheus.client.exporter.PushGateway.doRequest(PushGateway.java:315)~[flink-metrics-prometheus-1.13.5.jar:1.13.5] > at io.prometheus. client.exporter .PushGateway .push (PushGatevay . java:138) > ~[flink-metrics-prometheus-1.13.5. jar:1.13.51 > at > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter.report(PrometheusPushGatewayReporter.java:63) > [flink-metrics-prometheus-1.13.5.jar:1.13.51 > at org.apache. flink.runtime.metrics.MetricRegistryImp1$ReporterTask.run > (MetricRegistryImpl. java:494) [flink-dist_2.11-1.13.5.jar:1.13.5] > > after test, it was caused with amount of data reported to pushgateway, then > we restart pushgateway server and the exception disappeared, but after sever > hours the exception re-emergenced. > > so i want to know how to config flink or pushgateway to avoid the exception? > > best regards. > leilinee