Thank you very much for your supports.

> are you still deleting the WAL? 

No, I did not delete WAL at all. What I did was restarting a pod that has 
500K time series exposed.

> So either you picked very unlucky times to grab the profiles, or 
something else is inconsistent. 
Do you have any suggestion when heap profiles should be captured? e.g. one 
at the point right before restarting the target, and the second is at 6h 
later after the target restarts?

Do you need any logs, metrics or anything else that could help spotting the 
issue easier?

Regards, Vu
On Thursday, November 30, 2023 at 1:13:13 AM UTC+7 Bryan Boreham wrote:

> Thanks sending more details and profiles.
>
> 'heap_before.pprof' shows 1264MB in use and 'heap_after.pprof' shows 
> 1273MB.
> There are no material differences; the 'after' one has more memory used to 
> track series removed after head compaction.
> There are about 500,000 series objects in both profiles.
>
> I am confused why nothing shows up as allocated during WAL reading - are 
> you still deleting the WAL?
>
> The memory visible in heap profiles is after garbage-collection, while 
> go_memstats_next_gc_bytes is after heap growth, defaulting to 100% growth 
> i.e. that metric should be twice the amount in the profile.
> So either you picked very unlucky times to grab the profiles, or something 
> else is inconsistent.
>
> So, sorry but I cannot tie back what these profiles say to the symptom you 
> described.
>
> Regards,
>
> Bryan
>
> On Friday, 24 November 2023 at 03:40:52 UTC [email protected] wrote:
>
>> Hi Bryan,
>>
>> I managed to reproduce the problem and captured the data as you suggested.
>>
>> First, here are the graphs in UTC timezone:
>>
>> [image: prometheus_latest_memory_increase_after_upgrade_v3.png]
>>
>> and for heap profiles, please have a look at the attachments.
>>
>> Thank you for your supports.
>> On Monday, November 6, 2023 at 10:45:32 PM UTC+7 Bryan Boreham wrote:
>>
>>> I think this issue is relevant: 
>>> https://github.com/prometheus/prometheus/issues/12286
>>>
>>> I didn't follow your description of the symptoms; 
>>>
>>> > the memory goes up to 3.7Gi comparing to 2.5Gi 
>>>
>>> In your picture I see spikes at over 5Gi.   The spikes are every 2 hours 
>>> which would tie in to head compactions.  
>>> If you state what timezone your charts are in, or better show them in 
>>> UTC, we could be more sure.
>>>
>>> Note that working set and RSS is Linux' estimation of what the process 
>>> is using; it is not concrete enough to reason from.
>>> Suggest you add go_memstats_next_gc_bytes to your chart; this is tied to 
>>> what the program is actually referencing.
>>>
>>> A Go heap profile is even more concrete and detailed. See here 
>>> <https://github.com/prometheus/prometheus/issues/6934#issuecomment-1708499430>
>>> .
>>>
>>> Bryan
>>>
>>>
>>> On Friday, 3 November 2023 at 06:22:57 UTC-5 Vu Nguyen wrote:
>>>
>>>> If we go clean all data under /wal then restart Prometheus, then the 
>>>> memory comes back to the low point as it was before triggering the 
>>>> restart. 
>>>> But we don't want to apply that trick as we could lose 3h time span of 
>>>> data.
>>>>
>>>> On Wednesday, November 1, 2023 at 11:24:23 AM UTC+7 Vu Nguyen wrote:
>>>>
>>>>> Leaving the deployment running for a while after the 3rd restart of 
>>>>> the target - 6 rounds of WAL truncation, the memory goes up to 3.7Gi 
>>>>> comparing to 2.5Gi before doing the restart. There must be something that 
>>>>> Prometheus holds back for this upgrade/restart scenario I guess.
>>>>>
>>>>> [image: 
>>>>> ask_prometheus_user_memory_increase_after_target_restart_upgrade_v2.jpg]
>>>>>
>>>>> On Tuesday, October 31, 2023 at 10:07:24 PM UTC+7 Vu Nguyen wrote:
>>>>>
>>>>>> We have Prometheus v2.47.1 deployed on k8s; scraping 500k time series 
>>>>>> from a single target (*)
>>>>>>
>>>>>> When restart the target, the number of time series in HEAD block jump 
>>>>>> to 1M (1), and Prometheus memory increases from the average of 2.5Gi to 
>>>>>> 3Gi. Leave Prometheus running for few WAL truncation cycles, the memory 
>>>>>> still not go back to the point before restarting the target even the 
>>>>>> number 
>>>>>> of time series in HEAD block back to 500K.
>>>>>>
>>>>>> If I trigger another target restarts, that memory keeps going up. 
>>>>>> Here is the graph:
>>>>>>
>>>>>> Could you please help us understand why the memory does not fallback 
>>>>>> to the initial point (*) before we restart/upgrade target?
>>>>>>
>>>>>> [1] k8s pod restart will come up with a new IP - new instance label 
>>>>>> value; therefore, a new set of of 500K time series is generated.
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ecc0da5b-777e-485c-94f6-b6797c9f86dbn%40googlegroups.com.

Reply via email to