Re: [prometheus-users] Prometheus High RAM Investigation

Brian Candler Fri, 11 Feb 2022 01:19:20 -0800

> installed_software{package_name="GConf2.x86_64",version="3.2.6-8.el7"}
> (our custom metric) that has high cardinality

Cardinality refers to the labels, and those labels are not necessarily high 
cardinality, as long as they come from a relatively small set and don't 
keep changing dynamically.  You'll see something similar in standard 
node_exporter metrics like node_uname_info.

Think of it like this.  After the scrape, prometheus adds it's own "job" 
and "instance" labels.  So, let's say each target exposes 1,000 
'installed_software' metrics, and you have 100 targets, then you'll have a 
total of 100,000 timeseries which look like this:

installed_software{instance="server1",job="node",package_name="GConf2.x86_64",version="3.2.6-8.el7"}

1
installed_software{instance="server1",job="node",package_name="blah.x86_64",version="a.b.c.d-el7"}

1
installed_software{instance="server2",job="node",package_name="GConf2.x86_64",version="3.2.1-4.el7"}

1
... etc

However, as long as those labels aren't *changing*, then you will just have 
the same 100,000 timeseries over time, and that's not a large number of 
timeseries for Prometheus to deal with. (Aside: they all have the static 
value "1", so the delta between adjacent scrapes is 0, so they also 
compress extremely well on disk).  Furthermore, the distinct label values 
like "GConf2.x86_64" are shared between many timeseries.

Now, looking in your TSDB Stats, you have about 1 million of these, 
followed by 1 million of node_systemd_unit_state, followed by a whole bunch 
of metrics artemis_XXX each with 200K each.  That *is* a lot of metrics, 
and I can see how this could easily add up to 8 million.  You could count 
all the artemis ones like this:

count({__name__=~"artemis_.*"})

If you genuinely have 8 million timeseries then you're going to have to 
make a decision.

1. throw money at this to scale Prometheus up to a suitable scale
2. decide which of these metrics have low business value and stop 
collecting them
3. reduce the number of metrics, e.g. by aggregating them at the exporter
4. collect, store and query this information a different way

Only you can decide the tradeoff.  It seems to me that many of these 
metrics like "artemis_message_count" could be really valuable.  But do you 
really have 200K separate message brokers, or are these metrics giving you 
too much detail (e.g. separate stats per queue)? Could you just aggregate 
them down to a single value per node, whilst maintaining their usefulness?

On Friday, 11 February 2022 at 08:28:55 UTC [email protected] wrote:

> Hi guys,
>
> Current pprof stats: https://pastebin.com/0UnhvWpH
>
> @Brian: I do have a metric called node_systemd_unit_state (systemd metric) 
> and installed_software{package_name="GConf2.x86_64",version="3.2.6-8.el7"} 
> (our custom metric) that has high cardinality 
> Full TSDB Status here: https://pastebin.com/DFJ2k3Q1
>
> I think I'll probably also consider Loki if this problem is not resolvable
>
> Let me post the tsdb stats, our monitoring team keeps clearing targets out.
>
> TIA,
> Shubham Shrivastava
>
> On Thursday, 10 February 2022 at 05:48:37 UTC-8 [email protected] wrote:
>
>> Your *8257* metrics per node means you have *825,700* metrics on the 
>> server. The typical usage of Prometheus is around 8KiB per active series. 
>> This is expected to need *7GiB* of memory.
>>
>> The problem is you have posted pprof and memory usage that do not match 
>> what your claims are. Without *data while under your real load*, it's 
>> impossible to tell you what is wrong.
>>
>> It's also useful to include the *prometheus_tsdb_head_series* metric. 
>> But again, while you have your 100 servers configured.
>>
>> You need to post more information about your actual setup. Please include 
>> your full configuration file and version information.
>>
>> On Thu, Feb 10, 2022 at 3:45 AM Shubham Shrivastav <
>> [email protected]> wrote:
>>
>>> Hi all, 
>>>
>>> I've been investigating Prometheus memory utilization over the last 
>>> couple of days.
>>>
>>> Based on *pprof* command outputs, I do see a lot of memory utilized by 
>>> *getOrSet* function, but according to docs, it's just for creating new 
>>> series, so not sure what I can do about it.
>>>
>>>
>>> Pprof "top" output: 
>>> https://pastebin.com/bAF3fGpN
>>>
>>> Also, to figure out if I have any metrics that I can remove I ran ./tsdb 
>>> analyze on memory *(output here: https://pastebin.com/twsFiuRk 
>>> <https://pastebin.com/twsFiuRk>)*
>>>
>>> I did find some metrics having more cardinality than others but the 
>>> difference was not very massive.
>>>
>>> With ~100 nodes our RAM takes around 15 Gigs.
>>>
>>> We're getting* average Metrics Per node: 8257*
>>> Our estimation is around 200 nodes, which will make our RAM go through 
>>> the roof.
>>>
>>> Present Situation:
>>> Prometheus Containers got restarted due to OOM and I have fewer targets 
>>> now (~6). That's probably why numbers seem low, but the metrics pulled will 
>>> be the same.
>>> I was trying to recognize the pattern 
>>>
>>> Some metrics: 
>>>
>>> *process_resident_memory_bytes{instance="localhost:9090", job="prometheus"} 
>>> 1536786432*
>>> *go_memstats_alloc_bytes{instance="localhost:9090", job="prometheus"} 
>>> 908149496*
>>>
>>> Apart from distributing our load over multiple Prometheus nodes, are 
>>> there any alternatives?
>>>
>>>
>>>
>>> TIA,
>>> Shubham
>>>
>>> -- 
>>>
>> You received this message because you are subscribed to the Google Groups 
>>> "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-users/a74b38ab-ee70-46c6-bd5c-563aede095f4n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/prometheus-users/a74b38ab-ee70-46c6-bd5c-563aede095f4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/dd5d7dba-4615-4d1a-8268-ace8e4ab7415n%40googlegroups.com.

Re: [prometheus-users] Prometheus High RAM Investigation

Reply via email to