[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

koly li Mon, 20 May 2024 01:25:54 -0700

Thanks a lot, Brian. I think I should create an issue.

On Monday, May 20, 2024 at 1:37:51 PM UTC+8 Brian Candler wrote:


> > server returned HTTP status 500 Internal Server Error: too old sample
>
> This is not the server failing to process the data; it's the client 
> supplying invalid data. You found that this has been fixed to a 400.
>
> > server returned HTTP status 500 Internal Server Error: label name 
> \"prometheus\" is not unique: invalid sample
>
> I can't speak for the authors, but it looks to me like that should be a 
> 400 as well.
>
> On Monday 20 May 2024 at 04:52:03 UTC+1 koly li wrote:
>
>> Sorry for my poor description. Here is the story:
>>
>> 1) At first, we are using prometheus v2.47
>>
>> Then we found all metrics are missing, we check the prometheus log and 
>> prometheus agent log:
>>
>> prometheus log(lots of lines):
>> ts=2024-04-19T20:33:26.485Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-19T20:33:26.539Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-19T20:33:26.626Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-19T20:33:26.775Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-19T20:33:27.042Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-19T20:33:27.552Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ....
>> ts=2024-04-22T03:00:03.327Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-22T03:00:08.394Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>>
>> prometheus agent logs:
>> ts=2024-04-19T20:33:26.517Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-19T20:34:29.714Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-19T20:35:30.113Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-19T20:36:30.478Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ....
>> ts=2024-04-22T02:56:57.281Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-22T02:57:57.624Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-22T02:58:57.943Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-22T02:59:58.267Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-22T03:00:58.733Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>>
>> Then we check the codes:
>>
>> https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L77
>>
>> The "too old sample" is considered an 500. And the agent keeps retrying 
>> (exit only when the error is not Recoverable, and 500 is considered 
>> Recoverable):
>>
>> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/queue_manager.go#L1670
>>
>> > You may have come across a bug where a *particular* piece of data being 
>> sent by the agent was causing a *particular* version of prometheus to fail 
>> with a 5xx internal error every time. The logs should make it clear if this 
>> was happening.
>> We guess there is one or more samples with too old timestamps which cause 
>> the problem. One or more samples with "too old timestamp" cause the 
>> prometheus agent to retry forever (agent receives 500) , which prevents new 
>> samples to be sent. 
>>
>> Because there is no logging about the incorrect samples, we cannot figure 
>> out what these samples are. It logs samples only for several errors: 
>> https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L132
>>
>> >  The fundamental issue here is, why should restarting the *agent* cause 
>> the prometheus *server* to stop returning 500 errors?
>> We restart the agent 1~2 days after the problem occurs. The new data does 
>> not contain too old samples. That's why 500 errors disappear.
>>
>> 2) then we upgrade to v2.51
>> The new version returns 400 for "too old samples":
>>
>> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L72
>>
>> However, we encountered another 500:
>> prometheus agent log:
>> ts=2024-05-11T08:42:01.235Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: label name \"prometheus\" is not unique: invalid sample"
>> ts=2024-05-11T08:42:02.749Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: label name \"service\" is not unique: invalid sample"
>> ts=2024-05-11T08:42:02.798Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: label name \"resourceType\" is not unique: invalid sample"
>> ts=2024-05-11T08:42:02.851Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: label name \"namespace\" is not unique: invalid sample"
>>
>> we modify the code to log samples, then we get the prometheus log:
>> ts=2024-05-11T08:42:26.603Z caller=write_handler.go:134 level=error 
>> component=web msg="unknown error from remote write" err="label name 
>> \"resourceId\" is not unique: invalid sample" 
>> series="{__name__=\"ovs_vswitchd_interface_resets_total\", 
>> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
>> clusterZone=\"zone1\", container=\"kube-rbac-proxy\", 
>> endpoint=\"ovs-metrics\", hostname=\"20230428-wangbo-dev16\", 
>> if_name=\"veth99fa6555\", instance=\"10.253.58.238:9983\", 
>> job=\"net-monitor-vnet-ovs\", namespace=\"net-monitor\", 
>> pod=\"net-monitor-vnet-ovs-66bdz\", prometheus=\"monitoring/agent-0\", 
>> prometheus_replica=\"prometheus-agent-0-0\", 
>> resourceId=\"port-naqoi5tmkg5lrt0ubw\", resourceId=\"blb-74se39mqa9k3\", 
>> resourceType=\"Port\", resourceType=\"BLB\", rs_ip=\"10.0.0.3\", 
>> service=\"net-monitor-vnet-ovs\", service=\"net-monitor-vnet-ovs\", 
>> subnet_Id=\"snet-ztojflwrnd08xf5idw\", vip=\"11.4.2.64\", 
>> vpc_Id=\"vpc-6ss1uz29ctpfv0eqbj\", vpcid=\"11.4.2.64\"}" 
>> timestamp=1715349156000
>> ts=2024-05-11T08:42:26.603Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="label name 
>> \"resourceId\" is not unique: invalid sample"
>> ts=2024-05-11T08:42:26.967Z caller=write_handler.go:134 level=error 
>> component=web msg="unknown error from remote write" err="label name 
>> \"service\" is not unique: invalid sample" 
>> series="{__name__=\"rest_client_request_size_bytes_bucket\", 
>> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
>> clusterZone=\"zone1\", container=\"kube-scheduler\", endpoint=\"https\", 
>> host=\"127.0.0.1:6443\", instance=\"10.253.58.236:10259\", 
>> job=\"scheduler\", le=\"262144\", namespace=\"kube-scheduler\", 
>> pod=\"kube-scheduler-20230428-wangbo-dev14\", 
>> prometheus=\"monitoring/agent-0\", 
>> prometheus_replica=\"prometheus-agent-0-0\", resourceType=\"NETWORK-HOST\", 
>> service=\"scheduler\", service=\"net-monitor-vnet-ovs\", verb=\"POST\"}" 
>> timestamp=1715349164522
>> ts=2024-05-11T08:42:26.967Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="label name 
>> \"service\" is not unique: invalid sample"
>> ts=2024-05-11T08:42:27.091Z caller=write_handler.go:134 level=error 
>> component=web msg="unknown error from remote write" err="label name 
>> \"prometheus_replica\" is not unique: invalid sample" 
>> series="{__name__=\"workqueue_work_duration_seconds_sum\", 
>> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
>> clusterZone=\"zone1\", endpoint=\"https\", instance=\"21.100.10.52:8443\", 
>> job=\"metrics\", name=\"ResourceSyncController\", 
>> namespace=\"service-ca-operator\", 
>> pod=\"service-ca-operator-645cfdbfb6-rjr4z\", 
>> prometheus=\"monitoring/agent-0\", 
>> prometheus_replica=\"prometheus-agent-0-0\", 
>> prometheus_replica=\"prometheus-agent-0-0\", service=\"metrics\"}" 
>> timestamp=1715349271085
>> ts=2024-05-11T08:42:27.091Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="label name 
>> \"prometheus_replica\" is not unique: invalid sample"
>>
>> Currently we dont' know why there are duplicated labels. But when the 
>> server encounters duplicated labels, it returns 500. Then the agent keeps 
>> retrying, which means new samples cannot be handled.
>>
>> we set external_labels in prometheus-agent configs:
>> global:
>>   evaluation_interval: 30s
>>   scrape_interval: 5m
>>   scrape_timeout: 1m
>>   external_labels:
>>     clusterName: clustertest150
>>     clusterRegion: region0
>>     clusterZone: zone1
>>     prometheus: ccos-monitoring/agent-0
>>     prometheus_replica: prometheus-agent-0-0
>>   keep_dropped_targets: 1
>>
>> and the remote write config:
>> remote_write:
>> - url: https://prometheus-k8s-0.monitoring:9091/api/v1/write
>>   remote_timeout: 30s
>>   name: prometheus-k8s-0
>>   write_relabel_configs:
>>   - target_label: __tmp_cluster_id__
>>     replacement: 713c30cb-81c3-411d-b4dc-0c775a0f9564
>>     action: replace
>>   - regex: __tmp_cluster_id__
>>     action: labeldrop
>>   bearer_token: XDFSDF...
>>   tls_config:
>>     insecure_skip_verify: true
>>   queue_config:
>>     capacity: 10000
>>     min_shards: 1
>>     max_shards: 500
>>     max_samples_per_send: 2000
>>     batch_send_deadline: 10s
>>     min_backoff: 30ms
>>     max_backoff: 5s
>>     sample_age_limit: 5m
>>
>> > You are saying that you would prefer the agent to throw away data, 
>> rather than hold onto the data and try again later when it may succeed. In 
>> this situation, retrying is normally the correct thing to do.
>> Yes, retry is the normal solution. But there should be maximum number of 
>> retries. We notice that prometheus agent sets the retry nubmers to the 
>> request header, but it seems the request header is not used by the server.
>>
>> prometheus-agent sets the retry numbers to request header:
>>
>> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/client.go#L214
>>
>> Besides, if some samples is incorrect and others are correct in the same 
>> request, why don't prometheus server save the correct part and drop the 
>> wrong part? It is more complicated as retry should be considered, but is it 
>> possible to save partial data and return 206 when the maximum number of 
>> retries is reached? 
>>
>> And should prometheus server log samples for all kinds of error?
>>
>> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L133
>>
>> On Friday, May 17, 2024 at 8:15:04 PM UTC+8 Brian Candler wrote:
>>
>>> It's difficult to make sense of what you're saying. Without seeing logs 
>>> from both the agent and the server while this problem was occurring (e.g. 
>>> `journalctl -eu prometheus`), it's hard to know what was really happening. 
>>> Also you need to say what exact versions of prometheus and the agent were 
>>> running.
>>>
>>> The fundamental issue here is, why should restarting the *agent* cause 
>>> the prometheus *server* to stop returning 500 errors?
>>>
>>>
>>> > So my question is why 5xx from the promtheus server is considered 
>>> Recoverable?
>>>
>>> It is by definition of the HTTP protocol: 
>>> https://datatracker.ietf.org/doc/html/rfc2616#section-10.5
>>>
>>> Actually it depends on exactly which 5xx error code you're talking 
>>> about, but common 500 and 503 errors are generally transient, meaning there 
>>> was a problem at the server and the request may succeed if tried again 
>>> later.  If the prometheus server wanted to tell the client that the request 
>>> was invalid and could never possibly succeed, then it would return a 4xx 
>>> error.
>>>
>>> > And I believe there should be a way to exit the loop, for example a 
>>> maximum times to  retry.
>>>
>>> You are saying that you would prefer the agent to throw away data, 
>>> rather than hold onto the data and try again later when it may succeed. In 
>>> this situation, retrying is normally the correct thing to do.
>>>
>>> You may have come across a bug where a *particular* piece of data being 
>>> sent by the agent was causing a *particular* version of prometheus to fail 
>>> with a 5xx internal error every time. The logs should make it clear if this 
>>> was happening.
>>>
>>> On Friday 17 May 2024 at 10:02:49 UTC+1 koly li wrote:
>>>
>>>> Hello all,
>>>>
>>>> Recently we found that our samples are all lost. After some 
>>>> investigation, we found:
>>>> 1, we are using prometheus agent to send all data to prometheus server 
>>>> by remote write
>>>> 2, the agent sample sending code is in storage\remote\queue_manager.go, 
>>>> the function is sendWriteRequestWithBackoff()
>>>> 3, inside the function, if attempt(the function where request is made 
>>>> to prometheus server) function returns an Recoverable Error, then it will 
>>>> retry sending the request
>>>> 4, when a Recoverable error is returned? one scenario is the prometheus 
>>>> server returned 5xx error
>>>> 5, I think not every 5xx error is recoverable, and there is no other 
>>>> way to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps 
>>>> retrying but every time it receives an 5xx from the server. so we lost all 
>>>> samples for hours until we restart the agent
>>>>
>>>> So my question is why 5xx from the promtheus server is considered 
>>>> Recoverable? And I believe there should be a way to exit the loop, for 
>>>> example a maximum times to  retry.
>>>>
>>>> It seems that the agent mode is not mature enough to work in production.
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5661de1d-21c5-486c-9177-fa346ebdc922n%40googlegroups.com.

[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

Reply via email to