Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Štefan Miklošovič Fri, 03 Jan 2025 05:06:21 -0800

Right ... that sounds reasonable. Let's "sleep on it" for a while. It is
not something which is urgent to deal with right now but I find myself
quite often to identify the functionality where we go to the disk more
often than necessary and this was next on the list to take a look at
reading CASSANDRA-13338. So I took a look ... and here we are.


If you guys go to bump SSTable version in 5.1 / 6.0, this change might be
just shipped with that too.

On Fri, Jan 3, 2025 at 1:47 PM Benedict <bened...@apache.org> wrote:

> I’ve had a quick skim of the data sketches library, and it does seem to
> have made some more efficient decisions in its design than clearspring,
> appears to maybe support off-heap representations, and has reasonably good
> documentation about the theoretical properties of the sketches. The chair
> of the project is a published author on the topic, and the library has
> newer algorithms for cardinality estimation than HLL.
>
> So, honestly, it might not be a bad idea to (carefully) consider a
> migration, even if the current library isn’t broken for our needs.
>
> It would not be high up my priority list for the project, but I would
> support it if it scratches someone’s itch.
>
> On 3 Jan 2025, at 12:16, Štefan Miklošovič <smikloso...@apache.org> wrote:
>
> 
> Okay ... first problems.
>
> These 2000 bytes I have mentioned in my response to Chris were indeed
> correct, but that was with Datasketches and the main parameter for Hall
> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what we
> currently have in Cassandra with Clearspring, that doubled the size to
> ~4000 bytes.
>
> When we do not use Datasketches, what Clearspring generates is about ~5000
> bytes for the array itself but that array is wrapped into an ICardinality
> object of Clearspring and we need that object in order to merge another
> ICardinality into that. So, we would need to cache this ICardinality object
> instead of just an array itself. If we don't cache whole ICardinality, we
> would then need to do basically what
> CompactionMetadata.CompactionMetadataSerializer.deserialize is doing which
> would allocate a lot / often (ICardinality cardinality =
> HyperLogLogPlus.Builder.build(that_cached_array)).
>
> To avoid the allocations every time we compute, we would just cache that
> whole ICardinality of Clearspring, but that whole object measures like
> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables 500MB.
> That is becoming quite a problem.
>
> On the other hand, HllSketch of Datasketches, array included, adds minimal
> overhead. Like an array has 5000 bytes and the whole object like 5500. You
> got the idea ...
>
> If we are still OK with these sizes, sure ... I am just being transparent
> about the consequences here.
>
> A user would just opt-in into this (by default it would be turned off).
>
> On the other hand, if we have 10k SSTables, reading that 10+KB from disk
> takes around 2-3ms so we would read the disk 20/30 seconds every time we
> would hit that metric (and we haven't even started to merge the logs).
>
> If this is still not something which would sell Datasketches as a viable
> alternative then I guess we need to stick to these numbers and cache it all
> with Clearspring, occupying way more memory.
>
> On Thu, Jan 2, 2025 at 10:15 PM Benedict <bened...@apache.org> wrote:
>
>> I would like to see somebody who has some experience writing data
>> structures, preferably someone we trust as a community to be competent at
>> this (ie having some experience within the project contributing at this
>> level), look at the code like they were at least lightly reviewing the
>> feature as a contribution to this project.
>>
>> This should be the bar for any new library really, but triply so for
>> replacing a library that works fine.
>>
>> On 2 Jan 2025, at 21:02, Štefan Miklošovič <smikloso...@apache.org>
>> wrote:
>>
>> 
>> Point 2) is pretty hard to fulfil, I can not imagine what would be
>> "enough" for you to be persuaded. What should concretely happen? Because
>> whoever comes and says "yeah this is a good lib, it works" is probably not
>> going to be enough given the vague requirements you put under 2) You would
>> like to see exactly what?
>>
>> The way it looks to me is to just shut it down because of perceived churn
>> caused by that and there will always be some argument against that.
>>
>> Based on (1) I don't think what we have is bug free.
>>
>> Jeff:
>>
>> Thank you for that answer, I think we are on the same page that caching
>> it is just fine, that's what I got from your last two paragraphs.
>>
>> So the path from here is
>>
>> 1) add datasketches and cache
>> 2) don't add datasketches and cache it anyway
>>
>> The introduction of datasketches lib is not the absolute must in order to
>> achieve that, we can cache / compute it parallel with Clearspring as well,
>> it is just a bitter-sweet solution which just doesn't feel right.
>>
>> (1) https://github.com/addthis/stream-lib/issues
>>
>> On Thu, Jan 2, 2025 at 9:26 PM Benedict <bened...@apache.org> wrote:
>>
>>> Your message seemed to be all about the caching proposal, which I have
>>> proposed we separate, hence my confusion.
>>>
>>> To restate my answer to your question, I think that unless the new
>>> library actually offers us concrete benefits we can point to that we
>>> actually care about then yes it’s a bad idea to incur the churn of
>>> migration.
>>>
>>> I’m not inherently opposed to a migration but simply “new is better” is
>>> just plain wrong. Nothing you’ve presented yet convinces me this library is
>>> worth the effort of vetting given our current solution works fine.
>>>
>>> My position is that for any new library we should:
>>>
>>> 1) Point to something it solves that we actually want and is worth the
>>> time investment
>>> 2) Solicit folk in the community competent in the relevant data
>>> structures to vet the library for the proposed functionality
>>>
>>> The existing solution never went through (2) because it dates from the
>>> dark ages where we just threw dependencies in willynilly. But it has the
>>> benefit of having been used for a very long time without incident.
>>>
>>>
>>>
>>> On 2 Jan 2025, at 20:12, Štefan Miklošovič <smikloso...@apache.org>
>>> wrote:
>>>
>>> 
>>> Hi Benedict,
>>>
>>> you wrote:
>>>
>>> I am strongly opposed to updating libraries simply for the sake of it.
>>> Something like HLL does not need much ongoing maintenance if it works.
>>> We’re simply asking for extra work and bugs by switching, and some risk
>>> without understanding the quality control for the new library project’s
>>> releases.
>>>
>>> I understand this. But really, do you think that it is a bad idea to
>>> switch to a well maintained library which is already used quite widely (the
>>> website mentions extensions for sketches in Apache Druid, Hive, Pig, Pinot
>>> and PostgreSQL) and using the library which was abandoned for 6 years?
>>>
>>> As I mentioned there is also extensive comparison with Clearspring (1)
>>> where all performance benefits / speedups etc present in detail with charts
>>> attached.
>>>
>>> I think this is a mature project, under Apache, so when we think that a
>>> 6 years old and abandoned library is better than what Apache Datasketches
>>> provides, then the question is what are we doing here? Are we not believing
>>> what Apache itself offers and we need to rely on a 6 years old and dead
>>> library instead of that? Huh? That lib has 3k commits, releases often, it's
>>> a pretty active project ...
>>>
>>> I don't say that we should not test more deeply how it behaves, we might
>>> even re-consider the parameters of hyperloglog as we do that. But I don't
>>> think that having this library introduced would cause some kind of a
>>> widespread / systemic risk.
>>>
>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>>>
>>> On Thu, Jan 2, 2025 at 5:03 PM Benedict <bened...@apache.org> wrote:
>>>
>>>> I am strongly opposed to updating libraries simply for the sake of it.
>>>> Something like HLL does not need much ongoing maintenance if it works.
>>>> We’re simply asking for extra work and bugs by switching, and some risk
>>>> without understanding the quality control for the new library project’s
>>>> releases.
>>>>
>>>> That said, I was not very impressed with the clear spring library when
>>>> I looked at it, so I would be open to a stronger argument about data
>>>> sketches being superior otherwise in a way that matters to us.
>>>>
>>>> If we are to replace the library, we should at the very least do proper
>>>> due diligence by reviewing the new library’s implementation(s) ourselves.
>>>> We cannot simply assume the new library behaves well for our use cases, or
>>>> is well maintained.
>>>>
>>>> We should also not use the fallback intersection method, as this would
>>>> represent a regression to compaction on upgrade. We should really convert
>>>> from one HLL to another.
>>>>
>>>> The proposal to reduce allocations appears to be orthogonal to this
>>>> library, so let’s separate out that discussion? If there’s evidence this
>>>> library alone improves the memory profile let’s discuss that.
>>>>
>>>>
>>>> On 2 Jan 2025, at 15:26, Chris Lohfink <clohfin...@gmail.com> wrote:
>>>>
>>>> 
>>>> I think switching to datasketches is a good idea first off simply
>>>> because of the lack of maintenance and improvements from clearspring. I am
>>>> however, am not sold that it will actually improve anything significantly.
>>>> Caches might help on small cases, but those small cases probably are not
>>>> actually impacted. In the large cases the caches cost more in complexity,
>>>> memory, and ultimately wont matter when theres 50k sstables and the cache
>>>> holds 1k so everythings hitting disk anyway.
>>>>
>>>> The 5% is missing some relevant information like what the allocation
>>>> rate was, how many tables there are etc. On an idle system thats
>>>> meaningless, if there were 5gb/s allocations of reads/writes happening at
>>>> the time thats huge.
>>>>
>>>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič <
>>>> smikloso...@apache.org> wrote:
>>>>
>>>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it
>>>>> compounds. I think it is worth trying to do something with this. Would be
>>>>> great if you were part of this effort!
>>>>>
>>>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <netud...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have seen this place in async profiler memory allocation profile on
>>>>>> one of production environments some time ago, it was visible but not
>>>>>> dramatic, about 5% of allocations:
>>>>>> <image.png>
>>>>>>
>>>>>> The amount of overhead also depends on a metric collection frequency
>>>>>> (in my case it was once per 60 seconds)
>>>>>>
>>>>>> Regards,
>>>>>> Dmitry
>>>>>>
>>>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič <
>>>>>> smikloso...@apache.org> wrote:
>>>>>>
>>>>>>> Indeed, I plan to measure it and compare, maybe some bench test
>>>>>>> would be cool to add ..
>>>>>>>
>>>>>>> I strongly suspect that the primary reason for the slowness (if it
>>>>>>> is verified to be true) is us going to the disk every time and reading
>>>>>>> stats for every SSTable all over again.
>>>>>>>
>>>>>>> While datasketches say that it is way faster to update (1), we are
>>>>>>> living in a realm of nanoseconds here and I don't think that itself 
>>>>>>> would
>>>>>>> make any meaningful difference when merging one hyperloglog with 
>>>>>>> another as
>>>>>>> part of partition rows estimation computation. The only place we are
>>>>>>> updating is SortableTableWriter#endParition which calls
>>>>>>> metadatacollector.addKey(key.getKey()) which eventually updates the
>>>>>>> estimator via cardinality#offeredHashed.
>>>>>>>
>>>>>>> In other words, I think that going to the disk and reading it
>>>>>>> repeatedly is disproportionally more IO / time intensive than switching 
>>>>>>> the
>>>>>>> hyperloglog implementation.
>>>>>>>
>>>>>>> However, I consider the replacement of the library still important.
>>>>>>> I feel uneasy about staying with an abandoned library where there is
>>>>>>> clearly a well-maintained replacement.
>>>>>>>
>>>>>>> What we could do is to cache all cardinality estimators and just
>>>>>>> merge it all when asked upon metric resolution. That is different from
>>>>>>> going to disk to deserialize StatsComponent's all over again.
>>>>>>>
>>>>>>> Then on SSTable removal, we would remove that from cache too. I
>>>>>>> think there is some kind of an observer when SSTable is removed ...
>>>>>>>
>>>>>>> However, I am not sure I can just hold it all in the memory, it
>>>>>>> works for laptop testing but if we have thousands of SSTables with
>>>>>>> non-trivial number of rows things start to get interesting.
>>>>>>>
>>>>>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html -
>>>>>>> section HllSketch vs. HyperLogLogPlus Update Speed Behavior
>>>>>>>
>>>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <j...@rustyrazorblade.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sounds interesting.  I took a look at the issue but I'm not seeing
>>>>>>>> any data to back up "expensive".  Can this be quantified a bit more?
>>>>>>>>
>>>>>>>> Anytime we have a performance related issue, there should be some
>>>>>>>> data to back it up, even if it seems obvious.
>>>>>>>>
>>>>>>>> Jon
>>>>>>>>
>>>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič <
>>>>>>>> smikloso...@apache.org> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I just stumbled upon this library we are using for getting
>>>>>>>>> estimations of the number of partitions in a SSTable which are used 
>>>>>>>>> e.g. in
>>>>>>>>> EstimatedPartitionCount metric. (1)
>>>>>>>>>
>>>>>>>>> A user reported in (1) that it is an expensive operation. When one
>>>>>>>>> looks into what it is doing, it calls
>>>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically goes to 
>>>>>>>>> disk
>>>>>>>>> every single time, it loads all Stats components and it looks into
>>>>>>>>> CompactionMetadata where the cardinality estimator is located.
>>>>>>>>>
>>>>>>>>> We are serializing the hyperloglog to disk as part of a SSTable
>>>>>>>>> and we deserialize it back in runtime for every SSTable in a CF and we
>>>>>>>>> merge them all to one cardinality again.
>>>>>>>>>
>>>>>>>>> I do not think there is a way around this because of the nature of
>>>>>>>>> how a cardinality estimator works (hyperloglog). We can not "cache 
>>>>>>>>> it", it
>>>>>>>>> would work only in case we are adding SSTables only - hence we would 
>>>>>>>>> just
>>>>>>>>> merge again - but if we remove an SSTable as part of the compaction, 
>>>>>>>>> we can
>>>>>>>>> not "unmerge" it.
>>>>>>>>>
>>>>>>>>> That being said, we are currently using this library for
>>>>>>>>> hyperloglog (1) which was archived in summer 2020 and nothing was
>>>>>>>>> contributed to that for 6 years. That lib is dead.
>>>>>>>>>
>>>>>>>>> There is very nice replacement of that (2) directly from Apache
>>>>>>>>> (!!!) and they are even giving the detailed and in-depth comparison of
>>>>>>>>> hyperloglog implementation found in stream-lib we happen to use (3)
>>>>>>>>> (stream-lib = Clearspring) where they are saying that updating is way
>>>>>>>>> faster and it is also giving better estimations in general.
>>>>>>>>>
>>>>>>>>> I have implemented the usage of both cardinality estimators (4),
>>>>>>>>> (5). The reason we need to keep the old one around is that we may 
>>>>>>>>> have old
>>>>>>>>> SSTables around and we need to work with them too. That translates to 
>>>>>>>>> a new
>>>>>>>>> SSTable version (ob) which uses new implementation and for versions < 
>>>>>>>>> ob,
>>>>>>>>> it uses the old one. When SSTables are upgraded from oa to ob, the old
>>>>>>>>> estimator will not be used anymore.
>>>>>>>>>
>>>>>>>>> There is also a case of a user not upgrading his oa SSTables,
>>>>>>>>> turning a node on and creating new SSTables with ob version. When this
>>>>>>>>> happens and we ask what is the cardinality (e.g via nodetool 
>>>>>>>>> tablestats), I
>>>>>>>>> am checking if all SSTables are on the same version or not. If they 
>>>>>>>>> are,
>>>>>>>>> they will use either an old or new estimator. (we can not merge 
>>>>>>>>> estimations
>>>>>>>>> from two different hyperloglog implementations). If they are not, it 
>>>>>>>>> will
>>>>>>>>> compute that from index summaries. (The computation for index 
>>>>>>>>> summaries was
>>>>>>>>> already in place (6) as a fail-over in case the estimation computation
>>>>>>>>> failed / was not present).
>>>>>>>>>
>>>>>>>>> Does this all make sense to drive further to the completion and
>>>>>>>>> eventually merge this work to trunk?
>>>>>>>>>
>>>>>>>>> Worth to add that Apache Datasketches are just two dependencies
>>>>>>>>> for us, it has zero external dependencies.
>>>>>>>>>
>>>>>>>>> (1) https://github.com/addthis/stream-lib
>>>>>>>>> (2) https://datasketches.apache.org/
>>>>>>>>> (3) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>>>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338
>>>>>>>>> (5) https://github.com/apache/cassandra/pull/3767
>>>>>>>>> (6)
>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dmitry Konstantinov
>>>>>>
>>>>>

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Reply via email to