RE: Socket timeout when report metrics to pushgateway

Jiabao Sun Sun, 17 Dec 2023 05:52:08 -0800

Hi,

The pushgateway uses push mode to report metrics. When deployed on a single 
machine under high load, there may be some performance issues. 
A simple solution is to set up multiple pushgateways and push the metrics to 
different pushgateways based on different task groups.


There are other metrics reporters available based on the push model, such as 
InfluxDB[1]. In a clustered mode, InfluxDB may offer better performance than 
pushgateway. 
You can try using InfluxDB as an alternative and evaluate its performance.

I speculate that the reason for using pushgateway is because when running Flink 
with YARN application or per job mode, the task ports are randomized, 
making it difficult for prometheus to determine which task to scrape. 

By the way, if you deploy tasks using the flink kubernetes operator,  you can 
directly use the prometheus metrics reporter without the need for 
pushgateway[2].

Best,
Jiabao

[1] 
https://nightlies.apache.org/flink/flink-docs-master/zh/docs/deployment/metric_reporters/#influxdb
[2] 
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#how-to-enable-prometheus-example


On 2023/12/12 08:23:22 李琳 wrote:
> hello,
>   we build flink report metrics to prometheus pushgateway, the program has 
> been running for a period of time, with a amount of data reported to 
> pushgateway, pushgateway response socket timeout exception, and much of 
> metrics data reported failed. following is the exception:
> 
> 
>  2023-12-12 04:13:07,812 WARN 
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter [] - Failed 
> to push metrics to PushGateway with jobName
> 00034937_20231211200917_54ede15602bb8704c3a98ec481bea96, groupingKey{}.
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream. socketRead(Native Method) ~[?:1.8.0_281]
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
> ~[?:1.8.0 281]
> at java.net.SocketInputStream.read(SocketInputStream. java:171) ~[?:1.8.0 
> 281] at java.net.SocketInputStream.read(SocketInputStream. java:141) 
> ~[?:1.8.0 2811
> at java.io.BufferedInputStream.fill (BufferedInputStream. java:246) ~[?:1.8.0 
> 2811 at java.io. BufferedInputStream.read1(BufferedInputStream.java:286) 
> ~[?:1.8.0_281] at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0 281] 
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) 
> ~[?:1.8.0_281] at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) 
> ~[?:1.8.0_281] at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
>  ~[?:1.8.0_281] at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
>  ~[?:1.8.0 2811 at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)~[?:1.8.0_281]
>  at 
> io.prometheus.client.exporter.PushGateway.doRequest(PushGateway.java:315)~[flink-metrics-prometheus-1.13.5.jar:1.13.5]
> at io.prometheus. client.exporter .PushGateway .push (PushGatevay . java:138) 
> ~[flink-metrics-prometheus-1.13.5. jar:1.13.51
> at 
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter.report(PrometheusPushGatewayReporter.java:63)
> [flink-metrics-prometheus-1.13.5.jar:1.13.51
> at org.apache. flink.runtime.metrics.MetricRegistryImp1$ReporterTask.run 
> (MetricRegistryImpl. java:494) [flink-dist_2.11-1.13.5.jar:1.13.5]
> 
> after test, it was caused with amount of data reported to pushgateway, then 
> we restart pushgateway server and the exception disappeared, but after sever 
> hours the exception re-emergenced.
> 
> so i want to know how to config flink or pushgateway to avoid the exception?
> 
> best regards.
> leilinee

RE: Socket timeout when report metrics to pushgateway

Reply via email to