Hi guys,

Current pprof stats: https://pastebin.com/0UnhvWpH

@Brian: I do have a metric called node_systemd_unit_state (systemd metric) 
and installed_software{package_name="GConf2.x86_64",version="3.2.6-8.el7"} 
(our custom metric) that has high cardinality 
Full TSDB Status here: https://pastebin.com/DFJ2k3Q1

I think I'll probably also consider Loki if this problem is not resolvable

Let me post the tsdb stats, our monitoring team keeps clearing targets out.

TIA,
Shubham Shrivastava

On Thursday, 10 February 2022 at 05:48:37 UTC-8 [email protected] wrote:

> Your *8257* metrics per node means you have *825,700* metrics on the 
> server. The typical usage of Prometheus is around 8KiB per active series. 
> This is expected to need *7GiB* of memory.
>
> The problem is you have posted pprof and memory usage that do not match 
> what your claims are. Without *data while under your real load*, it's 
> impossible to tell you what is wrong.
>
> It's also useful to include the *prometheus_tsdb_head_series* metric. But 
> again, while you have your 100 servers configured.
>
> You need to post more information about your actual setup. Please include 
> your full configuration file and version information.
>
> On Thu, Feb 10, 2022 at 3:45 AM Shubham Shrivastav <[email protected]> 
> wrote:
>
>> Hi all, 
>>
>> I've been investigating Prometheus memory utilization over the last 
>> couple of days.
>>
>> Based on *pprof* command outputs, I do see a lot of memory utilized by 
>> *getOrSet* function, but according to docs, it's just for creating new 
>> series, so not sure what I can do about it.
>>
>>
>> Pprof "top" output: 
>> https://pastebin.com/bAF3fGpN
>>
>> Also, to figure out if I have any metrics that I can remove I ran ./tsdb 
>> analyze on memory *(output here: https://pastebin.com/twsFiuRk 
>> <https://pastebin.com/twsFiuRk>)*
>>
>> I did find some metrics having more cardinality than others but the 
>> difference was not very massive.
>>
>> With ~100 nodes our RAM takes around 15 Gigs.
>>
>> We're getting* average Metrics Per node: 8257*
>> Our estimation is around 200 nodes, which will make our RAM go through 
>> the roof.
>>
>> Present Situation:
>> Prometheus Containers got restarted due to OOM and I have fewer targets 
>> now (~6). That's probably why numbers seem low, but the metrics pulled will 
>> be the same.
>> I was trying to recognize the pattern 
>>
>> Some metrics: 
>>
>> *process_resident_memory_bytes{instance="localhost:9090", job="prometheus"} 
>> 1536786432*
>> *go_memstats_alloc_bytes{instance="localhost:9090", job="prometheus"} 
>> 908149496*
>>
>> Apart from distributing our load over multiple Prometheus nodes, are 
>> there any alternatives?
>>
>>
>>
>> TIA,
>> Shubham
>>
>> -- 
>>
> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/a74b38ab-ee70-46c6-bd5c-563aede095f4n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/a74b38ab-ee70-46c6-bd5c-563aede095f4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/255cc740-065b-4da4-8eb8-d595b8bc67e1n%40googlegroups.com.

Reply via email to