Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Jeff Jirsa Thu, 02 Jan 2025 12:20:55 -0800

I’m going to type a lot of extra words mostly for people just barely familiar 
with this part of the codebase, because it may or may not be useful to passive 
observers (it wasn’t to me, so I’m mostly echo’ing the things I just went and 
learned this morning):


The HLL cardinality is used for basically 2-3 things:
- It was introduced in https://issues.apache.org/jira/browse/CASSANDRA-5906 to 
try to make the bloom filter size more accurate, especially when compacting two 
SStables each with N keys, because you don’t know if you have N keys or 2N keys 
after they merge.
- It’s still used for that. It’s also used for “other” times we may want to 
“count” the number of partitions, for example, in per-table metrics. 
- The advantage over HLL vs just a single `long` per stable is the ability to 
merge the HLL structure across sstables, so that you count the actual number of 
partitions and (try to) avoid overcounting when the same partition lives in 
many sstables. 
- If there is no HLL data for a given sstable, we return an alternative but 
similar metric - a non-deduplicated long instead, which is estimated by 
counting the number of partition index entries for an sstable * the minimum 
index interval. 

There’s basically nothing that’s ever calling the key cardinality code in the 
hot path or in a tight loop (it’s not in the read path, and I’ve personally 
never seen it show up in a compaction profile). The biggest impact is almost 
certainly in observability codepaths, where people are either querying with 
node tool in a loop, JMX scraping, or exporting various prom metrics, and 
unique partitions is something people may actually want to have graphed on a 
dashboard, but the impact  depends a ton on (how many sstables you have * how 
often you’re scraping metrics). That 1.5MB of HLL data itself isn’t a big deal, 
to keep in memory, and “just” keeping that in memory probably speeds things up 
quite a bit for people who have a ton of sstables, scrape metrics often, and / 
or have slow disks with that data somehow evicted from page cache. 

When we talk about allocation rates, though, there’s going to be “allocations 
caused by materializing the HLL off disk into heap”, and “allocations caused by 
merging HLL data across sstables”. You can remove the first by caching, and 
that seems fine and uncontroversial. You can potentially even remove the second 
by caching the end result, and invalidating it when new sstables show up. Also 
fine and uncontroversial (to me, at least). To Chris’ point, the benefit here 
probably depends heavily on cache size, and if that cache size is < 100%, 
you’re potentially still going to see the same disk pattern, but it looks like 
you were proposing caching “all of them” (which is what I’d do, too). 

I don’t know why we’d tie those optimizations to the library replacement, 
though. The library replacement has a ton of extra logic around the upgrading 
stuff you’ve already tackled, and the caching is just caching. 




> On Jan 2, 2025, at 11:47 AM, Štefan Miklošovič <smikloso...@apache.org> wrote:
> 
> -> about 800 live SSTables 
> 
> Well, that would occupy 1.5MB of hyperloglogs each having 2000 bytes. That's 
> peanuts. Instead of going 800 times to the disk every minute. 
> 
> On Thu, Jan 2, 2025 at 8:18 PM Chris Lohfink <clohfin...@gmail.com 
> <mailto:clohfin...@gmail.com>> wrote:
>> > Regarding allocation details. The DB host had the following stats at that 
>> > time: 5K/sec local reads, 3K/sec local writes, about 800 live SSTables, 
>> > the profile was collected with duration = 5 minutes. I do not have an 
>> > allocation rate info for that time period.
>> 
>> What was the allocation rate on heap I mean. If there's 100mb/s of heap 
>> allocation it doesn't really matter if it's 5%. If there's 5gb/s it does. 
>> Assuming replacement takes 0% isn't always safe to assume either. Could 
>> alternatively provide a different "partition count" metric that just uses 
>> the index summary, it would be cheap/freeish for metric collections when 
>> accuracy doesnt matter, but then for things like `nodetool tablestats` can 
>> still be accurate more expensive path.
>> 
>> I can be convinced with benchmarks but I've never seen this so much as a 
>> blimp on the radar, but we are still running 4.1 so maybe things have 
>> improved so much it's become relevant. Just want to make sure that whatever 
>> we replace it with is benchmarked to not make things worse.
>> 
>> Chris
>> 
>> 
>> 
>> On Thu, Jan 2, 2025 at 10:42 AM Dmitry Konstantinov <netud...@gmail.com 
>> <mailto:netud...@gmail.com>> wrote:
>>> Let me clarify my comment regarding allocation. I am not saying that 
>>> switching to another implementation will make it better and we need to do 
>>> it right now :-), any such switch is a subject for pros/cons analysis (and 
>>> memory allocation I think should be one of criteria). What I wanted to say: 
>>> this place has attracted my attention as potentially suspicious from 
>>> another perspective, so it makes sense to analyse it in more detail, not 
>>> more as of now.
>>> 
>>> Regarding allocation details. The DB host had the following stats at that 
>>> time: 5K/sec local reads, 3K/sec local writes, about 800 live SSTables, the 
>>> profile was collected with duration = 5 minutes. I do not have an 
>>> allocation rate info for that time period.
>>> 
>>> 
>>> On Thu, 2 Jan 2025 at 16:02, Benedict <bened...@apache.org 
>>> <mailto:bened...@apache.org>> wrote:
>>>> I am strongly opposed to updating libraries simply for the sake of it. 
>>>> Something like HLL does not need much ongoing maintenance if it works. 
>>>> We’re simply asking for extra work and bugs by switching, and some risk 
>>>> without understanding the quality control for the new library project’s 
>>>> releases.
>>>> 
>>>> That said, I was not very impressed with the clear spring library when I 
>>>> looked at it, so I would be open to a stronger argument about data 
>>>> sketches being superior otherwise in a way that matters to us.
>>>> 
>>>> If we are to replace the library, we should at the very least do proper 
>>>> due diligence by reviewing the new library’s implementation(s) ourselves. 
>>>> We cannot simply assume the new library behaves well for our use cases, or 
>>>> is well maintained.
>>>> 
>>>> We should also not use the fallback intersection method, as this would 
>>>> represent a regression to compaction on upgrade. We should really convert 
>>>> from one HLL to another.
>>>> 
>>>> The proposal to reduce allocations appears to be orthogonal to this 
>>>> library, so let’s separate out that discussion? If there’s evidence this 
>>>> library alone improves the memory profile let’s discuss that.
>>>> 
>>>> 
>>>>> On 2 Jan 2025, at 15:26, Chris Lohfink <clohfin...@gmail.com 
>>>>> <mailto:clohfin...@gmail.com>> wrote:
>>>>> 
>>>>> 
>>>>> I think switching to datasketches is a good idea first off simply because 
>>>>> of the lack of maintenance and improvements from clearspring. I am 
>>>>> however, am not sold that it will actually improve anything 
>>>>> significantly. Caches might help on small cases, but those small cases 
>>>>> probably are not actually impacted. In the large cases the caches cost 
>>>>> more in complexity, memory, and ultimately wont matter when theres 50k 
>>>>> sstables and the cache holds 1k so everythings hitting disk anyway.
>>>>> 
>>>>> The 5% is missing some relevant information like what the allocation rate 
>>>>> was, how many tables there are etc. On an idle system thats meaningless, 
>>>>> if there were 5gb/s allocations of reads/writes happening at the time 
>>>>> thats huge.
>>>>> 
>>>>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič <smikloso...@apache.org 
>>>>> <mailto:smikloso...@apache.org>> wrote:
>>>>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it 
>>>>>> compounds. I think it is worth trying to do something with this. Would 
>>>>>> be great if you were part of this effort!
>>>>>> 
>>>>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <netud...@gmail.com 
>>>>>> <mailto:netud...@gmail.com>> wrote:
>>>>>>> I have seen this place in async profiler memory allocation profile on 
>>>>>>> one of production environments some time ago, it was visible but not 
>>>>>>> dramatic, about 5% of allocations:
>>>>>>> <image.png>
>>>>>>> 
>>>>>>> The amount of overhead also depends on a metric collection frequency 
>>>>>>> (in my case it was once per 60 seconds)
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Dmitry
>>>>>>> 
>>>>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič <smikloso...@apache.org 
>>>>>>> <mailto:smikloso...@apache.org>> wrote:
>>>>>>>> Indeed, I plan to measure it and compare, maybe some bench test would 
>>>>>>>> be cool to add .. 
>>>>>>>> 
>>>>>>>> I strongly suspect that the primary reason for the slowness (if it is 
>>>>>>>> verified to be true) is us going to the disk every time and reading 
>>>>>>>> stats for every SSTable all over again.
>>>>>>>> 
>>>>>>>> While datasketches say that it is way faster to update (1), we are 
>>>>>>>> living in a realm of nanoseconds here and I don't think that itself 
>>>>>>>> would make any meaningful difference when merging one hyperloglog with 
>>>>>>>> another as part of partition rows estimation computation. The only 
>>>>>>>> place we are updating is SortableTableWriter#endParition which calls 
>>>>>>>> metadatacollector.addKey(key.getKey()) which eventually updates the 
>>>>>>>> estimator via cardinality#offeredHashed.
>>>>>>>> 
>>>>>>>> In other words, I think that going to the disk and reading it 
>>>>>>>> repeatedly is disproportionally more IO / time intensive than 
>>>>>>>> switching the hyperloglog implementation. 
>>>>>>>> 
>>>>>>>> However, I consider the replacement of the library still important. I 
>>>>>>>> feel uneasy about staying with an abandoned library where there is 
>>>>>>>> clearly a well-maintained replacement.
>>>>>>>> 
>>>>>>>> What we could do is to cache all cardinality estimators and just merge 
>>>>>>>> it all when asked upon metric resolution. That is different from going 
>>>>>>>> to disk to deserialize StatsComponent's all over again.
>>>>>>>> 
>>>>>>>> Then on SSTable removal, we would remove that from cache too. I think 
>>>>>>>> there is some kind of an observer when SSTable is removed ...
>>>>>>>> 
>>>>>>>> However, I am not sure I can just hold it all in the memory, it works 
>>>>>>>> for laptop testing but if we have thousands of SSTables with 
>>>>>>>> non-trivial number of rows things start to get interesting. 
>>>>>>>> 
>>>>>>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html - 
>>>>>>>> section HllSketch vs. HyperLogLogPlus Update Speed Behavior
>>>>>>>> 
>>>>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <j...@rustyrazorblade.com 
>>>>>>>> <mailto:j...@rustyrazorblade.com>> wrote:
>>>>>>>>> Sounds interesting.  I took a look at the issue but I'm not seeing 
>>>>>>>>> any data to back up "expensive".  Can this be quantified a bit more? 
>>>>>>>>> 
>>>>>>>>> Anytime we have a performance related issue, there should be some 
>>>>>>>>> data to back it up, even if it seems obvious.
>>>>>>>>> 
>>>>>>>>> Jon
>>>>>>>>> 
>>>>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič 
>>>>>>>>> <smikloso...@apache.org <mailto:smikloso...@apache.org>> wrote:
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> I just stumbled upon this library we are using for getting 
>>>>>>>>>> estimations of the number of partitions in a SSTable which are used 
>>>>>>>>>> e.g. in EstimatedPartitionCount metric. (1) 
>>>>>>>>>> 
>>>>>>>>>> A user reported in (1) that it is an expensive operation. When one 
>>>>>>>>>> looks into what it is doing, it calls 
>>>>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically goes to 
>>>>>>>>>> disk every single time, it loads all Stats components and it looks 
>>>>>>>>>> into CompactionMetadata where the cardinality estimator is located.
>>>>>>>>>> 
>>>>>>>>>> We are serializing the hyperloglog to disk as part of a SSTable and 
>>>>>>>>>> we deserialize it back in runtime for every SSTable in a CF and we 
>>>>>>>>>> merge them all to one cardinality again.
>>>>>>>>>> 
>>>>>>>>>> I do not think there is a way around this because of the nature of 
>>>>>>>>>> how a cardinality estimator works (hyperloglog). We can not "cache 
>>>>>>>>>> it", it would work only in case we are adding SSTables only - hence 
>>>>>>>>>> we would just merge again - but if we remove an SSTable as part of 
>>>>>>>>>> the compaction, we can not "unmerge" it. 
>>>>>>>>>> 
>>>>>>>>>> That being said, we are currently using this library for hyperloglog 
>>>>>>>>>> (1) which was archived in summer 2020 and nothing was contributed to 
>>>>>>>>>> that for 6 years. That lib is dead.
>>>>>>>>>> 
>>>>>>>>>> There is very nice replacement of that (2) directly from Apache 
>>>>>>>>>> (!!!) and they are even giving the detailed and in-depth comparison 
>>>>>>>>>> of hyperloglog implementation found in stream-lib we happen to use 
>>>>>>>>>> (3) (stream-lib = Clearspring) where they are saying that updating 
>>>>>>>>>> is way faster and it is also giving better estimations in general.
>>>>>>>>>> 
>>>>>>>>>> I have implemented the usage of both cardinality estimators (4), 
>>>>>>>>>> (5). The reason we need to keep the old one around is that we may 
>>>>>>>>>> have old SSTables around and we need to work with them too. That 
>>>>>>>>>> translates to a new SSTable version (ob) which uses new 
>>>>>>>>>> implementation and for versions < ob, it uses the old one. When 
>>>>>>>>>> SSTables are upgraded from oa to ob, the old estimator will not be 
>>>>>>>>>> used anymore.
>>>>>>>>>> 
>>>>>>>>>> There is also a case of a user not upgrading his oa SSTables, 
>>>>>>>>>> turning a node on and creating new SSTables with ob version. When 
>>>>>>>>>> this happens and we ask what is the cardinality (e.g via nodetool 
>>>>>>>>>> tablestats), I am checking if all SSTables are on the same version 
>>>>>>>>>> or not. If they are, they will use either an old or new estimator. 
>>>>>>>>>> (we can not merge estimations from two different hyperloglog 
>>>>>>>>>> implementations). If they are not, it will compute that from index 
>>>>>>>>>> summaries. (The computation for index summaries was already in place 
>>>>>>>>>> (6) as a fail-over in case the estimation computation failed / was 
>>>>>>>>>> not present).
>>>>>>>>>> 
>>>>>>>>>> Does this all make sense to drive further to the completion and 
>>>>>>>>>> eventually merge this work to trunk?
>>>>>>>>>> 
>>>>>>>>>> Worth to add that Apache Datasketches are just two dependencies for 
>>>>>>>>>> us, it has zero external dependencies.
>>>>>>>>>> 
>>>>>>>>>> (1) https://github.com/addthis/stream-lib
>>>>>>>>>> (2) https://datasketches.apache.org/
>>>>>>>>>> (3) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>>>>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338
>>>>>>>>>> (5) https://github.com/apache/cassandra/pull/3767
>>>>>>>>>> (6) 
>>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338
>>>>>>>>>> 
>>>>>>>>>> Regards
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Dmitry Konstantinov
>>> 
>>> 
>>> 
>>> --
>>> Dmitry Konstantinov

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Reply via email to