Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Štefan Miklošovič Wed, 12 Mar 2025 04:38:44 -0700

Benedict,

I have reached Datasketches community (1) and asked what they think about
Clearspring and if it is convertible to Datasketches as you earlier
suggested that we might try to convert one to the other.


Based on what they wrote, I do not think that is possible to do (2) and
they say that Clearspring has "serious error problems" and it does not
implement Google's HLL++ paper correctly etc.

As I see it, in case we have SSTables with both old and new format, we
might compute keys like here (3). This code would be exercised only as long
as there are mixed formats. If we upgrade SSTables to a new format or if
old SSTables are compacated away to SSTables of new format, we would not do
it like in (3) anymore.

If we are OK with this, then I would try to spend more time on finishing
the PR and do some perf tests etc. so we might compare before / after.

How does that sound?

Regards

(1) https://lists.apache.org/thread/4rhbqzqyh1cn0pmbst8som4kvvko8gqp
(2) https://lists.apache.org/thread/l00yv67wwtztgl5lopdtbw3z9s7fng5b
(3) https://github.com/apache/cassandra/pull/3767/files#r1989136062

On Fri, Jan 3, 2025 at 1:47 PM Benedict <[email protected]> wrote:

> I’ve had a quick skim of the data sketches library, and it does seem to
> have made some more efficient decisions in its design than clearspring,
> appears to maybe support off-heap representations, and has reasonably good
> documentation about the theoretical properties of the sketches. The chair
> of the project is a published author on the topic, and the library has
> newer algorithms for cardinality estimation than HLL.
>
> So, honestly, it might not be a bad idea to (carefully) consider a
> migration, even if the current library isn’t broken for our needs.
>
> It would not be high up my priority list for the project, but I would
> support it if it scratches someone’s itch.
>
> On 3 Jan 2025, at 12:16, Štefan Miklošovič <[email protected]> wrote:
>
> 
> Okay ... first problems.
>
> These 2000 bytes I have mentioned in my response to Chris were indeed
> correct, but that was with Datasketches and the main parameter for Hall
> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what we
> currently have in Cassandra with Clearspring, that doubled the size to
> ~4000 bytes.
>
> When we do not use Datasketches, what Clearspring generates is about ~5000
> bytes for the array itself but that array is wrapped into an ICardinality
> object of Clearspring and we need that object in order to merge another
> ICardinality into that. So, we would need to cache this ICardinality object
> instead of just an array itself. If we don't cache whole ICardinality, we
> would then need to do basically what
> CompactionMetadata.CompactionMetadataSerializer.deserialize is doing which
> would allocate a lot / often (ICardinality cardinality =
> HyperLogLogPlus.Builder.build(that_cached_array)).
>
> To avoid the allocations every time we compute, we would just cache that
> whole ICardinality of Clearspring, but that whole object measures like
> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables 500MB.
> That is becoming quite a problem.
>
> On the other hand, HllSketch of Datasketches, array included, adds minimal
> overhead. Like an array has 5000 bytes and the whole object like 5500. You
> got the idea ...
>
> If we are still OK with these sizes, sure ... I am just being transparent
> about the consequences here.
>
> A user would just opt-in into this (by default it would be turned off).
>
> On the other hand, if we have 10k SSTables, reading that 10+KB from disk
> takes around 2-3ms so we would read the disk 20/30 seconds every time we
> would hit that metric (and we haven't even started to merge the logs).
>
> If this is still not something which would sell Datasketches as a viable
> alternative then I guess we need to stick to these numbers and cache it all
> with Clearspring, occupying way more memory.
>
> On Thu, Jan 2, 2025 at 10:15 PM Benedict <[email protected]> wrote:
>
>> I would like to see somebody who has some experience writing data
>> structures, preferably someone we trust as a community to be competent at
>> this (ie having some experience within the project contributing at this
>> level), look at the code like they were at least lightly reviewing the
>> feature as a contribution to this project.
>>
>> This should be the bar for any new library really, but triply so for
>> replacing a library that works fine.
>>
>> On 2 Jan 2025, at 21:02, Štefan Miklošovič <[email protected]>
>> wrote:
>>
>> 
>> Point 2) is pretty hard to fulfil, I can not imagine what would be
>> "enough" for you to be persuaded. What should concretely happen? Because
>> whoever comes and says "yeah this is a good lib, it works" is probably not
>> going to be enough given the vague requirements you put under 2) You would
>> like to see exactly what?
>>
>> The way it looks to me is to just shut it down because of perceived churn
>> caused by that and there will always be some argument against that.
>>
>> Based on (1) I don't think what we have is bug free.
>>
>> Jeff:
>>
>> Thank you for that answer, I think we are on the same page that caching
>> it is just fine, that's what I got from your last two paragraphs.
>>
>> So the path from here is
>>
>> 1) add datasketches and cache
>> 2) don't add datasketches and cache it anyway
>>
>> The introduction of datasketches lib is not the absolute must in order to
>> achieve that, we can cache / compute it parallel with Clearspring as well,
>> it is just a bitter-sweet solution which just doesn't feel right.
>>
>> (1) https://github.com/addthis/stream-lib/issues
>>
>> On Thu, Jan 2, 2025 at 9:26 PM Benedict <[email protected]> wrote:
>>
>>> Your message seemed to be all about the caching proposal, which I have
>>> proposed we separate, hence my confusion.
>>>
>>> To restate my answer to your question, I think that unless the new
>>> library actually offers us concrete benefits we can point to that we
>>> actually care about then yes it’s a bad idea to incur the churn of
>>> migration.
>>>
>>> I’m not inherently opposed to a migration but simply “new is better” is
>>> just plain wrong. Nothing you’ve presented yet convinces me this library is
>>> worth the effort of vetting given our current solution works fine.
>>>
>>> My position is that for any new library we should:
>>>
>>> 1) Point to something it solves that we actually want and is worth the
>>> time investment
>>> 2) Solicit folk in the community competent in the relevant data
>>> structures to vet the library for the proposed functionality
>>>
>>> The existing solution never went through (2) because it dates from the
>>> dark ages where we just threw dependencies in willynilly. But it has the
>>> benefit of having been used for a very long time without incident.
>>>
>>>
>>>
>>> On 2 Jan 2025, at 20:12, Štefan Miklošovič <[email protected]>
>>> wrote:
>>>
>>> 
>>> Hi Benedict,
>>>
>>> you wrote:
>>>
>>> I am strongly opposed to updating libraries simply for the sake of it.
>>> Something like HLL does not need much ongoing maintenance if it works.
>>> We’re simply asking for extra work and bugs by switching, and some risk
>>> without understanding the quality control for the new library project’s
>>> releases.
>>>
>>> I understand this. But really, do you think that it is a bad idea to
>>> switch to a well maintained library which is already used quite widely (the
>>> website mentions extensions for sketches in Apache Druid, Hive, Pig, Pinot
>>> and PostgreSQL) and using the library which was abandoned for 6 years?
>>>
>>> As I mentioned there is also extensive comparison with Clearspring (1)
>>> where all performance benefits / speedups etc present in detail with charts
>>> attached.
>>>
>>> I think this is a mature project, under Apache, so when we think that a
>>> 6 years old and abandoned library is better than what Apache Datasketches
>>> provides, then the question is what are we doing here? Are we not believing
>>> what Apache itself offers and we need to rely on a 6 years old and dead
>>> library instead of that? Huh? That lib has 3k commits, releases often, it's
>>> a pretty active project ...
>>>
>>> I don't say that we should not test more deeply how it behaves, we might
>>> even re-consider the parameters of hyperloglog as we do that. But I don't
>>> think that having this library introduced would cause some kind of a
>>> widespread / systemic risk.
>>>
>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>>>
>>> On Thu, Jan 2, 2025 at 5:03 PM Benedict <[email protected]> wrote:
>>>
>>>> I am strongly opposed to updating libraries simply for the sake of it.
>>>> Something like HLL does not need much ongoing maintenance if it works.
>>>> We’re simply asking for extra work and bugs by switching, and some risk
>>>> without understanding the quality control for the new library project’s
>>>> releases.
>>>>
>>>> That said, I was not very impressed with the clear spring library when
>>>> I looked at it, so I would be open to a stronger argument about data
>>>> sketches being superior otherwise in a way that matters to us.
>>>>
>>>> If we are to replace the library, we should at the very least do proper
>>>> due diligence by reviewing the new library’s implementation(s) ourselves.
>>>> We cannot simply assume the new library behaves well for our use cases, or
>>>> is well maintained.
>>>>
>>>> We should also not use the fallback intersection method, as this would
>>>> represent a regression to compaction on upgrade. We should really convert
>>>> from one HLL to another.
>>>>
>>>> The proposal to reduce allocations appears to be orthogonal to this
>>>> library, so let’s separate out that discussion? If there’s evidence this
>>>> library alone improves the memory profile let’s discuss that.
>>>>
>>>>
>>>> On 2 Jan 2025, at 15:26, Chris Lohfink <[email protected]> wrote:
>>>>
>>>> 
>>>> I think switching to datasketches is a good idea first off simply
>>>> because of the lack of maintenance and improvements from clearspring. I am
>>>> however, am not sold that it will actually improve anything significantly.
>>>> Caches might help on small cases, but those small cases probably are not
>>>> actually impacted. In the large cases the caches cost more in complexity,
>>>> memory, and ultimately wont matter when theres 50k sstables and the cache
>>>> holds 1k so everythings hitting disk anyway.
>>>>
>>>> The 5% is missing some relevant information like what the allocation
>>>> rate was, how many tables there are etc. On an idle system thats
>>>> meaningless, if there were 5gb/s allocations of reads/writes happening at
>>>> the time thats huge.
>>>>
>>>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič <
>>>> [email protected]> wrote:
>>>>
>>>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it
>>>>> compounds. I think it is worth trying to do something with this. Would be
>>>>> great if you were part of this effort!
>>>>>
>>>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I have seen this place in async profiler memory allocation profile on
>>>>>> one of production environments some time ago, it was visible but not
>>>>>> dramatic, about 5% of allocations:
>>>>>> <image.png>
>>>>>>
>>>>>> The amount of overhead also depends on a metric collection frequency
>>>>>> (in my case it was once per 60 seconds)
>>>>>>
>>>>>> Regards,
>>>>>> Dmitry
>>>>>>
>>>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Indeed, I plan to measure it and compare, maybe some bench test
>>>>>>> would be cool to add ..
>>>>>>>
>>>>>>> I strongly suspect that the primary reason for the slowness (if it
>>>>>>> is verified to be true) is us going to the disk every time and reading
>>>>>>> stats for every SSTable all over again.
>>>>>>>
>>>>>>> While datasketches say that it is way faster to update (1), we are
>>>>>>> living in a realm of nanoseconds here and I don't think that itself 
>>>>>>> would
>>>>>>> make any meaningful difference when merging one hyperloglog with 
>>>>>>> another as
>>>>>>> part of partition rows estimation computation. The only place we are
>>>>>>> updating is SortableTableWriter#endParition which calls
>>>>>>> metadatacollector.addKey(key.getKey()) which eventually updates the
>>>>>>> estimator via cardinality#offeredHashed.
>>>>>>>
>>>>>>> In other words, I think that going to the disk and reading it
>>>>>>> repeatedly is disproportionally more IO / time intensive than switching 
>>>>>>> the
>>>>>>> hyperloglog implementation.
>>>>>>>
>>>>>>> However, I consider the replacement of the library still important.
>>>>>>> I feel uneasy about staying with an abandoned library where there is
>>>>>>> clearly a well-maintained replacement.
>>>>>>>
>>>>>>> What we could do is to cache all cardinality estimators and just
>>>>>>> merge it all when asked upon metric resolution. That is different from
>>>>>>> going to disk to deserialize StatsComponent's all over again.
>>>>>>>
>>>>>>> Then on SSTable removal, we would remove that from cache too. I
>>>>>>> think there is some kind of an observer when SSTable is removed ...
>>>>>>>
>>>>>>> However, I am not sure I can just hold it all in the memory, it
>>>>>>> works for laptop testing but if we have thousands of SSTables with
>>>>>>> non-trivial number of rows things start to get interesting.
>>>>>>>
>>>>>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html -
>>>>>>> section HllSketch vs. HyperLogLogPlus Update Speed Behavior
>>>>>>>
>>>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sounds interesting.  I took a look at the issue but I'm not seeing
>>>>>>>> any data to back up "expensive".  Can this be quantified a bit more?
>>>>>>>>
>>>>>>>> Anytime we have a performance related issue, there should be some
>>>>>>>> data to back it up, even if it seems obvious.
>>>>>>>>
>>>>>>>> Jon
>>>>>>>>
>>>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I just stumbled upon this library we are using for getting
>>>>>>>>> estimations of the number of partitions in a SSTable which are used 
>>>>>>>>> e.g. in
>>>>>>>>> EstimatedPartitionCount metric. (1)
>>>>>>>>>
>>>>>>>>> A user reported in (1) that it is an expensive operation. When one
>>>>>>>>> looks into what it is doing, it calls
>>>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically goes to 
>>>>>>>>> disk
>>>>>>>>> every single time, it loads all Stats components and it looks into
>>>>>>>>> CompactionMetadata where the cardinality estimator is located.
>>>>>>>>>
>>>>>>>>> We are serializing the hyperloglog to disk as part of a SSTable
>>>>>>>>> and we deserialize it back in runtime for every SSTable in a CF and we
>>>>>>>>> merge them all to one cardinality again.
>>>>>>>>>
>>>>>>>>> I do not think there is a way around this because of the nature of
>>>>>>>>> how a cardinality estimator works (hyperloglog). We can not "cache 
>>>>>>>>> it", it
>>>>>>>>> would work only in case we are adding SSTables only - hence we would 
>>>>>>>>> just
>>>>>>>>> merge again - but if we remove an SSTable as part of the compaction, 
>>>>>>>>> we can
>>>>>>>>> not "unmerge" it.
>>>>>>>>>
>>>>>>>>> That being said, we are currently using this library for
>>>>>>>>> hyperloglog (1) which was archived in summer 2020 and nothing was
>>>>>>>>> contributed to that for 6 years. That lib is dead.
>>>>>>>>>
>>>>>>>>> There is very nice replacement of that (2) directly from Apache
>>>>>>>>> (!!!) and they are even giving the detailed and in-depth comparison of
>>>>>>>>> hyperloglog implementation found in stream-lib we happen to use (3)
>>>>>>>>> (stream-lib = Clearspring) where they are saying that updating is way
>>>>>>>>> faster and it is also giving better estimations in general.
>>>>>>>>>
>>>>>>>>> I have implemented the usage of both cardinality estimators (4),
>>>>>>>>> (5). The reason we need to keep the old one around is that we may 
>>>>>>>>> have old
>>>>>>>>> SSTables around and we need to work with them too. That translates to 
>>>>>>>>> a new
>>>>>>>>> SSTable version (ob) which uses new implementation and for versions < 
>>>>>>>>> ob,
>>>>>>>>> it uses the old one. When SSTables are upgraded from oa to ob, the old
>>>>>>>>> estimator will not be used anymore.
>>>>>>>>>
>>>>>>>>> There is also a case of a user not upgrading his oa SSTables,
>>>>>>>>> turning a node on and creating new SSTables with ob version. When this
>>>>>>>>> happens and we ask what is the cardinality (e.g via nodetool 
>>>>>>>>> tablestats), I
>>>>>>>>> am checking if all SSTables are on the same version or not. If they 
>>>>>>>>> are,
>>>>>>>>> they will use either an old or new estimator. (we can not merge 
>>>>>>>>> estimations
>>>>>>>>> from two different hyperloglog implementations). If they are not, it 
>>>>>>>>> will
>>>>>>>>> compute that from index summaries. (The computation for index 
>>>>>>>>> summaries was
>>>>>>>>> already in place (6) as a fail-over in case the estimation computation
>>>>>>>>> failed / was not present).
>>>>>>>>>
>>>>>>>>> Does this all make sense to drive further to the completion and
>>>>>>>>> eventually merge this work to trunk?
>>>>>>>>>
>>>>>>>>> Worth to add that Apache Datasketches are just two dependencies
>>>>>>>>> for us, it has zero external dependencies.
>>>>>>>>>
>>>>>>>>> (1) https://github.com/addthis/stream-lib
>>>>>>>>> (2) https://datasketches.apache.org/
>>>>>>>>> (3) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>>>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338
>>>>>>>>> (5) https://github.com/apache/cassandra/pull/3767
>>>>>>>>> (6)
>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dmitry Konstantinov
>>>>>>
>>>>>

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Reply via email to