Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Štefan Miklošovič Sat, 18 Oct 2025 02:08:13 -0700

Hi Benedict, all,

As part of ZSTD dictionary compression implementation (CEP-54), there is a
SSTable version bump introduced ("pa" SSTable format) I think this is a
pretty good chance to introduce Datasketches implementation of HLL and have
SSTables with that while the old format can stay with stream-lib. This is
the most straightforward way to go around the upgrade here.


Would the community be interested in replacing it? It was already mentioned
that stream-lib is abandoned for years etc.

On Thu, Mar 13, 2025 at 10:18 AM Benedict <[email protected]> wrote:

> Is there no way to use your own hash function?
>
> Because we’re already supplying murmur64 values from the most common
> partitioner (I assume we’re not rehashing the keys themselves as
> pointlessly costly)
>
> On 13 Mar 2025, at 07:12, Štefan Miklošovič <[email protected]>
> wrote:
>
> 
> For the record, they got back to me telling me it is definitely not
> possible to merge / convert.
>
> https://lists.apache.org/thread/8xscy9q0s8rqhsfovj03656zdh29qlnp
>
> On Wed, Mar 12, 2025 at 12:55 PM Benedict Elliott Smith <
> [email protected]> wrote:
>
>> Hi Stefan,
>>
>> My reading of this mailing list thread is that they think clearspring is
>> junk (probably fair) and so you shouldn’t use it or convert it. I am not
>> sure this actually means it cannot be done.
>>
>> That said, a simpler option might be to produce both sketches until we
>> can “upgrade” all of the legacy sstables to the new sketches. This would be
>> fine in my book, and probably much simpler.
>>
>> On 12 Mar 2025, at 11:37, Štefan Miklošovič <[email protected]>
>> wrote:
>>
>> Benedict,
>>
>> I have reached Datasketches community (1) and asked what they think about
>> Clearspring and if it is convertible to Datasketches as you earlier
>> suggested that we might try to convert one to the other.
>>
>> Based on what they wrote, I do not think that is possible to do (2) and
>> they say that Clearspring has "serious error problems" and it does not
>> implement Google's HLL++ paper correctly etc.
>>
>> As I see it, in case we have SSTables with both old and new format, we
>> might compute keys like here (3). This code would be exercised only as long
>> as there are mixed formats. If we upgrade SSTables to a new format or if
>> old SSTables are compacated away to SSTables of new format, we would not do
>> it like in (3) anymore.
>>
>> If we are OK with this, then I would try to spend more time on finishing
>> the PR and do some perf tests etc. so we might compare before / after.
>>
>> How does that sound?
>>
>> Regards
>>
>> (1) https://lists.apache.org/thread/4rhbqzqyh1cn0pmbst8som4kvvko8gqp
>> (2) https://lists.apache.org/thread/l00yv67wwtztgl5lopdtbw3z9s7fng5b
>> (3) https://github.com/apache/cassandra/pull/3767/files#r1989136062
>>
>> On Fri, Jan 3, 2025 at 1:47 PM Benedict <[email protected]> wrote:
>>
>>> I’ve had a quick skim of the data sketches library, and it does seem to
>>> have made some more efficient decisions in its design than clearspring,
>>> appears to maybe support off-heap representations, and has reasonably good
>>> documentation about the theoretical properties of the sketches. The chair
>>> of the project is a published author on the topic, and the library has
>>> newer algorithms for cardinality estimation than HLL.
>>>
>>> So, honestly, it might not be a bad idea to (carefully) consider a
>>> migration, even if the current library isn’t broken for our needs.
>>>
>>> It would not be high up my priority list for the project, but I would
>>> support it if it scratches someone’s itch.
>>>
>>> On 3 Jan 2025, at 12:16, Štefan Miklošovič <[email protected]>
>>> wrote:
>>>
>>> 
>>> Okay ... first problems.
>>>
>>> These 2000 bytes I have mentioned in my response to Chris were indeed
>>> correct, but that was with Datasketches and the main parameter for Hall
>>> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what we
>>> currently have in Cassandra with Clearspring, that doubled the size to
>>> ~4000 bytes.
>>>
>>> When we do not use Datasketches, what Clearspring generates is about
>>> ~5000 bytes for the array itself but that array is wrapped into an
>>> ICardinality object of Clearspring and we need that object in order to
>>> merge another ICardinality into that. So, we would need to cache this
>>> ICardinality object instead of just an array itself. If we don't cache
>>> whole ICardinality, we would then need to do basically what
>>> CompactionMetadata.CompactionMetadataSerializer.deserialize is doing which
>>> would allocate a lot / often (ICardinality cardinality =
>>> HyperLogLogPlus.Builder.build(that_cached_array)).
>>>
>>> To avoid the allocations every time we compute, we would just cache that
>>> whole ICardinality of Clearspring, but that whole object measures like
>>> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables 500MB.
>>> That is becoming quite a problem.
>>>
>>> On the other hand, HllSketch of Datasketches, array included, adds
>>> minimal overhead. Like an array has 5000 bytes and the whole object like
>>> 5500. You got the idea ...
>>>
>>> If we are still OK with these sizes, sure ... I am just being
>>> transparent about the consequences here.
>>>
>>> A user would just opt-in into this (by default it would be turned off).
>>>
>>> On the other hand, if we have 10k SSTables, reading that 10+KB from disk
>>> takes around 2-3ms so we would read the disk 20/30 seconds every time we
>>> would hit that metric (and we haven't even started to merge the logs).
>>>
>>> If this is still not something which would sell Datasketches as a viable
>>> alternative then I guess we need to stick to these numbers and cache it all
>>> with Clearspring, occupying way more memory.
>>>
>>> On Thu, Jan 2, 2025 at 10:15 PM Benedict <[email protected]> wrote:
>>>
>>>> I would like to see somebody who has some experience writing data
>>>> structures, preferably someone we trust as a community to be competent at
>>>> this (ie having some experience within the project contributing at this
>>>> level), look at the code like they were at least lightly reviewing the
>>>> feature as a contribution to this project.
>>>>
>>>> This should be the bar for any new library really, but triply so for
>>>> replacing a library that works fine.
>>>>
>>>> On 2 Jan 2025, at 21:02, Štefan Miklošovič <[email protected]>
>>>> wrote:
>>>>
>>>> 
>>>> Point 2) is pretty hard to fulfil, I can not imagine what would be
>>>> "enough" for you to be persuaded. What should concretely happen? Because
>>>> whoever comes and says "yeah this is a good lib, it works" is probably not
>>>> going to be enough given the vague requirements you put under 2) You would
>>>> like to see exactly what?
>>>>
>>>> The way it looks to me is to just shut it down because of perceived
>>>> churn caused by that and there will always be some argument against that.
>>>>
>>>> Based on (1) I don't think what we have is bug free.
>>>>
>>>> Jeff:
>>>>
>>>> Thank you for that answer, I think we are on the same page that caching
>>>> it is just fine, that's what I got from your last two paragraphs.
>>>>
>>>> So the path from here is
>>>>
>>>> 1) add datasketches and cache
>>>> 2) don't add datasketches and cache it anyway
>>>>
>>>> The introduction of datasketches lib is not the absolute must in order
>>>> to achieve that, we can cache / compute it parallel with Clearspring as
>>>> well, it is just a bitter-sweet solution which just doesn't feel right.
>>>>
>>>> (1) https://github.com/addthis/stream-lib/issues
>>>>
>>>> On Thu, Jan 2, 2025 at 9:26 PM Benedict <[email protected]> wrote:
>>>>
>>>>> Your message seemed to be all about the caching proposal, which I have
>>>>> proposed we separate, hence my confusion.
>>>>>
>>>>> To restate my answer to your question, I think that unless the new
>>>>> library actually offers us concrete benefits we can point to that we
>>>>> actually care about then yes it’s a bad idea to incur the churn of
>>>>> migration.
>>>>>
>>>>> I’m not inherently opposed to a migration but simply “new is better”
>>>>> is just plain wrong. Nothing you’ve presented yet convinces me this 
>>>>> library
>>>>> is worth the effort of vetting given our current solution works fine.
>>>>>
>>>>> My position is that for any new library we should:
>>>>>
>>>>> 1) Point to something it solves that we actually want and is worth the
>>>>> time investment
>>>>> 2) Solicit folk in the community competent in the relevant data
>>>>> structures to vet the library for the proposed functionality
>>>>>
>>>>> The existing solution never went through (2) because it dates from the
>>>>> dark ages where we just threw dependencies in willynilly. But it has the
>>>>> benefit of having been used for a very long time without incident.
>>>>>
>>>>>
>>>>>
>>>>> On 2 Jan 2025, at 20:12, Štefan Miklošovič <[email protected]>
>>>>> wrote:
>>>>>
>>>>> 
>>>>> Hi Benedict,
>>>>>
>>>>> you wrote:
>>>>>
>>>>> I am strongly opposed to updating libraries simply for the sake of it.
>>>>> Something like HLL does not need much ongoing maintenance if it works.
>>>>> We’re simply asking for extra work and bugs by switching, and some risk
>>>>> without understanding the quality control for the new library project’s
>>>>> releases.
>>>>>
>>>>> I understand this. But really, do you think that it is a bad idea to
>>>>> switch to a well maintained library which is already used quite widely 
>>>>> (the
>>>>> website mentions extensions for sketches in Apache Druid, Hive, Pig, Pinot
>>>>> and PostgreSQL) and using the library which was abandoned for 6 years?
>>>>>
>>>>> As I mentioned there is also extensive comparison with Clearspring (1)
>>>>> where all performance benefits / speedups etc present in detail with 
>>>>> charts
>>>>> attached.
>>>>>
>>>>> I think this is a mature project, under Apache, so when we think that
>>>>> a 6 years old and abandoned library is better than what Apache 
>>>>> Datasketches
>>>>> provides, then the question is what are we doing here? Are we not 
>>>>> believing
>>>>> what Apache itself offers and we need to rely on a 6 years old and dead
>>>>> library instead of that? Huh? That lib has 3k commits, releases often, 
>>>>> it's
>>>>> a pretty active project ...
>>>>>
>>>>> I don't say that we should not test more deeply how it behaves, we
>>>>> might even re-consider the parameters of hyperloglog as we do that. But I
>>>>> don't think that having this library introduced would cause some kind of a
>>>>> widespread / systemic risk.
>>>>>
>>>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>>>>>
>>>>> On Thu, Jan 2, 2025 at 5:03 PM Benedict <[email protected]> wrote:
>>>>>
>>>>>> I am strongly opposed to updating libraries simply for the sake of
>>>>>> it. Something like HLL does not need much ongoing maintenance if it 
>>>>>> works.
>>>>>> We’re simply asking for extra work and bugs by switching, and some risk
>>>>>> without understanding the quality control for the new library project’s
>>>>>> releases.
>>>>>>
>>>>>> That said, I was not very impressed with the clear spring library
>>>>>> when I looked at it, so I would be open to a stronger argument about data
>>>>>> sketches being superior otherwise in a way that matters to us.
>>>>>>
>>>>>> If we are to replace the library, we should at the very least do
>>>>>> proper due diligence by reviewing the new library’s implementation(s)
>>>>>> ourselves. We cannot simply assume the new library behaves well for our 
>>>>>> use
>>>>>> cases, or is well maintained.
>>>>>>
>>>>>> We should also not use the fallback intersection method, as this
>>>>>> would represent a regression to compaction on upgrade. We should really
>>>>>> convert from one HLL to another.
>>>>>>
>>>>>> The proposal to reduce allocations appears to be orthogonal to this
>>>>>> library, so let’s separate out that discussion? If there’s evidence this
>>>>>> library alone improves the memory profile let’s discuss that.
>>>>>>
>>>>>>
>>>>>> On 2 Jan 2025, at 15:26, Chris Lohfink <[email protected]> wrote:
>>>>>>
>>>>>> 
>>>>>> I think switching to datasketches is a good idea first off simply
>>>>>> because of the lack of maintenance and improvements from clearspring. I 
>>>>>> am
>>>>>> however, am not sold that it will actually improve anything 
>>>>>> significantly.
>>>>>> Caches might help on small cases, but those small cases probably are not
>>>>>> actually impacted. In the large cases the caches cost more in complexity,
>>>>>> memory, and ultimately wont matter when theres 50k sstables and the cache
>>>>>> holds 1k so everythings hitting disk anyway.
>>>>>>
>>>>>> The 5% is missing some relevant information like what the allocation
>>>>>> rate was, how many tables there are etc. On an idle system thats
>>>>>> meaningless, if there were 5gb/s allocations of reads/writes happening at
>>>>>> the time thats huge.
>>>>>>
>>>>>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it
>>>>>>> compounds. I think it is worth trying to do something with this. Would 
>>>>>>> be
>>>>>>> great if you were part of this effort!
>>>>>>>
>>>>>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I have seen this place in async profiler memory allocation profile
>>>>>>>> on one of production environments some time ago, it was visible but not
>>>>>>>> dramatic, about 5% of allocations:
>>>>>>>> <image.png>
>>>>>>>>
>>>>>>>> The amount of overhead also depends on a metric collection
>>>>>>>> frequency (in my case it was once per 60 seconds)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Dmitry
>>>>>>>>
>>>>>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Indeed, I plan to measure it and compare, maybe some bench test
>>>>>>>>> would be cool to add ..
>>>>>>>>>
>>>>>>>>> I strongly suspect that the primary reason for the slowness (if it
>>>>>>>>> is verified to be true) is us going to the disk every time and reading
>>>>>>>>> stats for every SSTable all over again.
>>>>>>>>>
>>>>>>>>> While datasketches say that it is way faster to update (1), we are
>>>>>>>>> living in a realm of nanoseconds here and I don't think that itself 
>>>>>>>>> would
>>>>>>>>> make any meaningful difference when merging one hyperloglog with 
>>>>>>>>> another as
>>>>>>>>> part of partition rows estimation computation. The only place we are
>>>>>>>>> updating is SortableTableWriter#endParition which calls
>>>>>>>>> metadatacollector.addKey(key.getKey()) which eventually updates the
>>>>>>>>> estimator via cardinality#offeredHashed.
>>>>>>>>>
>>>>>>>>> In other words, I think that going to the disk and reading it
>>>>>>>>> repeatedly is disproportionally more IO / time intensive than 
>>>>>>>>> switching the
>>>>>>>>> hyperloglog implementation.
>>>>>>>>>
>>>>>>>>> However, I consider the replacement of the library still
>>>>>>>>> important. I feel uneasy about staying with an abandoned library where
>>>>>>>>> there is clearly a well-maintained replacement.
>>>>>>>>>
>>>>>>>>> What we could do is to cache all cardinality estimators and just
>>>>>>>>> merge it all when asked upon metric resolution. That is different from
>>>>>>>>> going to disk to deserialize StatsComponent's all over again.
>>>>>>>>>
>>>>>>>>> Then on SSTable removal, we would remove that from cache too. I
>>>>>>>>> think there is some kind of an observer when SSTable is removed ...
>>>>>>>>>
>>>>>>>>> However, I am not sure I can just hold it all in the memory, it
>>>>>>>>> works for laptop testing but if we have thousands of SSTables with
>>>>>>>>> non-trivial number of rows things start to get interesting.
>>>>>>>>>
>>>>>>>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>>>>>>>>> - section HllSketch vs. HyperLogLogPlus Update Speed Behavior
>>>>>>>>>
>>>>>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sounds interesting.  I took a look at the issue but I'm not
>>>>>>>>>> seeing any data to back up "expensive".  Can this be quantified a bit
>>>>>>>>>> more?
>>>>>>>>>>
>>>>>>>>>> Anytime we have a performance related issue, there should be some
>>>>>>>>>> data to back it up, even if it seems obvious.
>>>>>>>>>>
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I just stumbled upon this library we are using for getting
>>>>>>>>>>> estimations of the number of partitions in a SSTable which are used 
>>>>>>>>>>> e.g. in
>>>>>>>>>>> EstimatedPartitionCount metric. (1)
>>>>>>>>>>>
>>>>>>>>>>> A user reported in (1) that it is an expensive operation. When
>>>>>>>>>>> one looks into what it is doing, it calls
>>>>>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically goes to 
>>>>>>>>>>> disk
>>>>>>>>>>> every single time, it loads all Stats components and it looks into
>>>>>>>>>>> CompactionMetadata where the cardinality estimator is located.
>>>>>>>>>>>
>>>>>>>>>>> We are serializing the hyperloglog to disk as part of a SSTable
>>>>>>>>>>> and we deserialize it back in runtime for every SSTable in a CF and 
>>>>>>>>>>> we
>>>>>>>>>>> merge them all to one cardinality again.
>>>>>>>>>>>
>>>>>>>>>>> I do not think there is a way around this because of the nature
>>>>>>>>>>> of how a cardinality estimator works (hyperloglog). We can not 
>>>>>>>>>>> "cache it",
>>>>>>>>>>> it would work only in case we are adding SSTables only - hence we 
>>>>>>>>>>> would
>>>>>>>>>>> just merge again - but if we remove an SSTable as part of the 
>>>>>>>>>>> compaction,
>>>>>>>>>>> we can not "unmerge" it.
>>>>>>>>>>>
>>>>>>>>>>> That being said, we are currently using this library for
>>>>>>>>>>> hyperloglog (1) which was archived in summer 2020 and nothing was
>>>>>>>>>>> contributed to that for 6 years. That lib is dead.
>>>>>>>>>>>
>>>>>>>>>>> There is very nice replacement of that (2) directly from Apache
>>>>>>>>>>> (!!!) and they are even giving the detailed and in-depth comparison 
>>>>>>>>>>> of
>>>>>>>>>>> hyperloglog implementation found in stream-lib we happen to use (3)
>>>>>>>>>>> (stream-lib = Clearspring) where they are saying that updating is 
>>>>>>>>>>> way
>>>>>>>>>>> faster and it is also giving better estimations in general.
>>>>>>>>>>>
>>>>>>>>>>> I have implemented the usage of both cardinality estimators (4),
>>>>>>>>>>> (5). The reason we need to keep the old one around is that we may 
>>>>>>>>>>> have old
>>>>>>>>>>> SSTables around and we need to work with them too. That translates 
>>>>>>>>>>> to a new
>>>>>>>>>>> SSTable version (ob) which uses new implementation and for versions 
>>>>>>>>>>> < ob,
>>>>>>>>>>> it uses the old one. When SSTables are upgraded from oa to ob, the 
>>>>>>>>>>> old
>>>>>>>>>>> estimator will not be used anymore.
>>>>>>>>>>>
>>>>>>>>>>> There is also a case of a user not upgrading his oa SSTables,
>>>>>>>>>>> turning a node on and creating new SSTables with ob version. When 
>>>>>>>>>>> this
>>>>>>>>>>> happens and we ask what is the cardinality (e.g via nodetool 
>>>>>>>>>>> tablestats), I
>>>>>>>>>>> am checking if all SSTables are on the same version or not. If they 
>>>>>>>>>>> are,
>>>>>>>>>>> they will use either an old or new estimator. (we can not merge 
>>>>>>>>>>> estimations
>>>>>>>>>>> from two different hyperloglog implementations). If they are not, 
>>>>>>>>>>> it will
>>>>>>>>>>> compute that from index summaries. (The computation for index 
>>>>>>>>>>> summaries was
>>>>>>>>>>> already in place (6) as a fail-over in case the estimation 
>>>>>>>>>>> computation
>>>>>>>>>>> failed / was not present).
>>>>>>>>>>>
>>>>>>>>>>> Does this all make sense to drive further to the completion and
>>>>>>>>>>> eventually merge this work to trunk?
>>>>>>>>>>>
>>>>>>>>>>> Worth to add that Apache Datasketches are just two dependencies
>>>>>>>>>>> for us, it has zero external dependencies.
>>>>>>>>>>>
>>>>>>>>>>> (1) https://github.com/addthis/stream-lib
>>>>>>>>>>> (2) https://datasketches.apache.org/
>>>>>>>>>>> (3)
>>>>>>>>>>> https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>>>>>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338
>>>>>>>>>>> (5) https://github.com/apache/cassandra/pull/3767
>>>>>>>>>>> (6)
>>>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Dmitry Konstantinov
>>>>>>>>
>>>>>>>
>>

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Reply via email to