I have summarized information from this mail thread to https://cwiki.apache.org/confluence/display/COC/SSTable%27s+partition+cardinality+implementation Probably later it can be transformed to a CEP.. Regarding experience of DataSketches library's authors and publications here there is a good summary in Background section: https://cwiki.apache.org/confluence/display/INCUBATOR/DataSketchesProposal . It looks good..
On Fri, 3 Jan 2025 at 13:06, Štefan Miklošovič <smikloso...@apache.org> wrote: > Right ... that sounds reasonable. Let's "sleep on it" for a while. It is > not something which is urgent to deal with right now but I find myself > quite often to identify the functionality where we go to the disk more > often than necessary and this was next on the list to take a look at > reading CASSANDRA-13338. So I took a look ... and here we are. > > If you guys go to bump SSTable version in 5.1 / 6.0, this change might be > just shipped with that too. > > On Fri, Jan 3, 2025 at 1:47 PM Benedict <bened...@apache.org> wrote: > >> I’ve had a quick skim of the data sketches library, and it does seem to >> have made some more efficient decisions in its design than clearspring, >> appears to maybe support off-heap representations, and has reasonably good >> documentation about the theoretical properties of the sketches. The chair >> of the project is a published author on the topic, and the library has >> newer algorithms for cardinality estimation than HLL. >> >> So, honestly, it might not be a bad idea to (carefully) consider a >> migration, even if the current library isn’t broken for our needs. >> >> It would not be high up my priority list for the project, but I would >> support it if it scratches someone’s itch. >> >> On 3 Jan 2025, at 12:16, Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >> >> Okay ... first problems. >> >> These 2000 bytes I have mentioned in my response to Chris were indeed >> correct, but that was with Datasketches and the main parameter for Hall >> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what we >> currently have in Cassandra with Clearspring, that doubled the size to >> ~4000 bytes. >> >> When we do not use Datasketches, what Clearspring generates is about >> ~5000 bytes for the array itself but that array is wrapped into an >> ICardinality object of Clearspring and we need that object in order to >> merge another ICardinality into that. So, we would need to cache this >> ICardinality object instead of just an array itself. If we don't cache >> whole ICardinality, we would then need to do basically what >> CompactionMetadata.CompactionMetadataSerializer.deserialize is doing which >> would allocate a lot / often (ICardinality cardinality = >> HyperLogLogPlus.Builder.build(that_cached_array)). >> >> To avoid the allocations every time we compute, we would just cache that >> whole ICardinality of Clearspring, but that whole object measures like >> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables 500MB. >> That is becoming quite a problem. >> >> On the other hand, HllSketch of Datasketches, array included, adds >> minimal overhead. Like an array has 5000 bytes and the whole object like >> 5500. You got the idea ... >> >> If we are still OK with these sizes, sure ... I am just being transparent >> about the consequences here. >> >> A user would just opt-in into this (by default it would be turned off). >> >> On the other hand, if we have 10k SSTables, reading that 10+KB from disk >> takes around 2-3ms so we would read the disk 20/30 seconds every time we >> would hit that metric (and we haven't even started to merge the logs). >> >> If this is still not something which would sell Datasketches as a viable >> alternative then I guess we need to stick to these numbers and cache it all >> with Clearspring, occupying way more memory. >> >> On Thu, Jan 2, 2025 at 10:15 PM Benedict <bened...@apache.org> wrote: >> >>> I would like to see somebody who has some experience writing data >>> structures, preferably someone we trust as a community to be competent at >>> this (ie having some experience within the project contributing at this >>> level), look at the code like they were at least lightly reviewing the >>> feature as a contribution to this project. >>> >>> This should be the bar for any new library really, but triply so for >>> replacing a library that works fine. >>> >>> On 2 Jan 2025, at 21:02, Štefan Miklošovič <smikloso...@apache.org> >>> wrote: >>> >>> >>> Point 2) is pretty hard to fulfil, I can not imagine what would be >>> "enough" for you to be persuaded. What should concretely happen? Because >>> whoever comes and says "yeah this is a good lib, it works" is probably not >>> going to be enough given the vague requirements you put under 2) You would >>> like to see exactly what? >>> >>> The way it looks to me is to just shut it down because of perceived >>> churn caused by that and there will always be some argument against that. >>> >>> Based on (1) I don't think what we have is bug free. >>> >>> Jeff: >>> >>> Thank you for that answer, I think we are on the same page that caching >>> it is just fine, that's what I got from your last two paragraphs. >>> >>> So the path from here is >>> >>> 1) add datasketches and cache >>> 2) don't add datasketches and cache it anyway >>> >>> The introduction of datasketches lib is not the absolute must in order >>> to achieve that, we can cache / compute it parallel with Clearspring as >>> well, it is just a bitter-sweet solution which just doesn't feel right. >>> >>> (1) https://github.com/addthis/stream-lib/issues >>> >>> On Thu, Jan 2, 2025 at 9:26 PM Benedict <bened...@apache.org> wrote: >>> >>>> Your message seemed to be all about the caching proposal, which I have >>>> proposed we separate, hence my confusion. >>>> >>>> To restate my answer to your question, I think that unless the new >>>> library actually offers us concrete benefits we can point to that we >>>> actually care about then yes it’s a bad idea to incur the churn of >>>> migration. >>>> >>>> I’m not inherently opposed to a migration but simply “new is better” is >>>> just plain wrong. Nothing you’ve presented yet convinces me this library is >>>> worth the effort of vetting given our current solution works fine. >>>> >>>> My position is that for any new library we should: >>>> >>>> 1) Point to something it solves that we actually want and is worth the >>>> time investment >>>> 2) Solicit folk in the community competent in the relevant data >>>> structures to vet the library for the proposed functionality >>>> >>>> The existing solution never went through (2) because it dates from the >>>> dark ages where we just threw dependencies in willynilly. But it has the >>>> benefit of having been used for a very long time without incident. >>>> >>>> >>>> >>>> On 2 Jan 2025, at 20:12, Štefan Miklošovič <smikloso...@apache.org> >>>> wrote: >>>> >>>> >>>> Hi Benedict, >>>> >>>> you wrote: >>>> >>>> I am strongly opposed to updating libraries simply for the sake of it. >>>> Something like HLL does not need much ongoing maintenance if it works. >>>> We’re simply asking for extra work and bugs by switching, and some risk >>>> without understanding the quality control for the new library project’s >>>> releases. >>>> >>>> I understand this. But really, do you think that it is a bad idea to >>>> switch to a well maintained library which is already used quite widely (the >>>> website mentions extensions for sketches in Apache Druid, Hive, Pig, Pinot >>>> and PostgreSQL) and using the library which was abandoned for 6 years? >>>> >>>> As I mentioned there is also extensive comparison with Clearspring (1) >>>> where all performance benefits / speedups etc present in detail with charts >>>> attached. >>>> >>>> I think this is a mature project, under Apache, so when we think that a >>>> 6 years old and abandoned library is better than what Apache Datasketches >>>> provides, then the question is what are we doing here? Are we not believing >>>> what Apache itself offers and we need to rely on a 6 years old and dead >>>> library instead of that? Huh? That lib has 3k commits, releases often, it's >>>> a pretty active project ... >>>> >>>> I don't say that we should not test more deeply how it behaves, we >>>> might even re-consider the parameters of hyperloglog as we do that. But I >>>> don't think that having this library introduced would cause some kind of a >>>> widespread / systemic risk. >>>> >>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html >>>> >>>> On Thu, Jan 2, 2025 at 5:03 PM Benedict <bened...@apache.org> wrote: >>>> >>>>> I am strongly opposed to updating libraries simply for the sake of it. >>>>> Something like HLL does not need much ongoing maintenance if it works. >>>>> We’re simply asking for extra work and bugs by switching, and some risk >>>>> without understanding the quality control for the new library project’s >>>>> releases. >>>>> >>>>> That said, I was not very impressed with the clear spring library when >>>>> I looked at it, so I would be open to a stronger argument about data >>>>> sketches being superior otherwise in a way that matters to us. >>>>> >>>>> If we are to replace the library, we should at the very least do >>>>> proper due diligence by reviewing the new library’s implementation(s) >>>>> ourselves. We cannot simply assume the new library behaves well for our >>>>> use >>>>> cases, or is well maintained. >>>>> >>>>> We should also not use the fallback intersection method, as this would >>>>> represent a regression to compaction on upgrade. We should really convert >>>>> from one HLL to another. >>>>> >>>>> The proposal to reduce allocations appears to be orthogonal to this >>>>> library, so let’s separate out that discussion? If there’s evidence this >>>>> library alone improves the memory profile let’s discuss that. >>>>> >>>>> >>>>> On 2 Jan 2025, at 15:26, Chris Lohfink <clohfin...@gmail.com> wrote: >>>>> >>>>> >>>>> I think switching to datasketches is a good idea first off simply >>>>> because of the lack of maintenance and improvements from clearspring. I am >>>>> however, am not sold that it will actually improve anything significantly. >>>>> Caches might help on small cases, but those small cases probably are not >>>>> actually impacted. In the large cases the caches cost more in complexity, >>>>> memory, and ultimately wont matter when theres 50k sstables and the cache >>>>> holds 1k so everythings hitting disk anyway. >>>>> >>>>> The 5% is missing some relevant information like what the allocation >>>>> rate was, how many tables there are etc. On an idle system thats >>>>> meaningless, if there were 5gb/s allocations of reads/writes happening at >>>>> the time thats huge. >>>>> >>>>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič < >>>>> smikloso...@apache.org> wrote: >>>>> >>>>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it >>>>>> compounds. I think it is worth trying to do something with this. Would be >>>>>> great if you were part of this effort! >>>>>> >>>>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov < >>>>>> netud...@gmail.com> wrote: >>>>>> >>>>>>> I have seen this place in async profiler memory allocation profile >>>>>>> on one of production environments some time ago, it was visible but not >>>>>>> dramatic, about 5% of allocations: >>>>>>> <image.png> >>>>>>> >>>>>>> The amount of overhead also depends on a metric collection frequency >>>>>>> (in my case it was once per 60 seconds) >>>>>>> >>>>>>> Regards, >>>>>>> Dmitry >>>>>>> >>>>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič < >>>>>>> smikloso...@apache.org> wrote: >>>>>>> >>>>>>>> Indeed, I plan to measure it and compare, maybe some bench test >>>>>>>> would be cool to add .. >>>>>>>> >>>>>>>> I strongly suspect that the primary reason for the slowness (if it >>>>>>>> is verified to be true) is us going to the disk every time and reading >>>>>>>> stats for every SSTable all over again. >>>>>>>> >>>>>>>> While datasketches say that it is way faster to update (1), we are >>>>>>>> living in a realm of nanoseconds here and I don't think that itself >>>>>>>> would >>>>>>>> make any meaningful difference when merging one hyperloglog with >>>>>>>> another as >>>>>>>> part of partition rows estimation computation. The only place we are >>>>>>>> updating is SortableTableWriter#endParition which calls >>>>>>>> metadatacollector.addKey(key.getKey()) which eventually updates the >>>>>>>> estimator via cardinality#offeredHashed. >>>>>>>> >>>>>>>> In other words, I think that going to the disk and reading it >>>>>>>> repeatedly is disproportionally more IO / time intensive than >>>>>>>> switching the >>>>>>>> hyperloglog implementation. >>>>>>>> >>>>>>>> However, I consider the replacement of the library still important. >>>>>>>> I feel uneasy about staying with an abandoned library where there is >>>>>>>> clearly a well-maintained replacement. >>>>>>>> >>>>>>>> What we could do is to cache all cardinality estimators and just >>>>>>>> merge it all when asked upon metric resolution. That is different from >>>>>>>> going to disk to deserialize StatsComponent's all over again. >>>>>>>> >>>>>>>> Then on SSTable removal, we would remove that from cache too. I >>>>>>>> think there is some kind of an observer when SSTable is removed ... >>>>>>>> >>>>>>>> However, I am not sure I can just hold it all in the memory, it >>>>>>>> works for laptop testing but if we have thousands of SSTables with >>>>>>>> non-trivial number of rows things start to get interesting. >>>>>>>> >>>>>>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html >>>>>>>> - section HllSketch vs. HyperLogLogPlus Update Speed Behavior >>>>>>>> >>>>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <j...@rustyrazorblade.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Sounds interesting. I took a look at the issue but I'm not seeing >>>>>>>>> any data to back up "expensive". Can this be quantified a bit more? >>>>>>>>> >>>>>>>>> Anytime we have a performance related issue, there should be some >>>>>>>>> data to back it up, even if it seems obvious. >>>>>>>>> >>>>>>>>> Jon >>>>>>>>> >>>>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič < >>>>>>>>> smikloso...@apache.org> wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> I just stumbled upon this library we are using for getting >>>>>>>>>> estimations of the number of partitions in a SSTable which are used >>>>>>>>>> e.g. in >>>>>>>>>> EstimatedPartitionCount metric. (1) >>>>>>>>>> >>>>>>>>>> A user reported in (1) that it is an expensive operation. When >>>>>>>>>> one looks into what it is doing, it calls >>>>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically goes to >>>>>>>>>> disk >>>>>>>>>> every single time, it loads all Stats components and it looks into >>>>>>>>>> CompactionMetadata where the cardinality estimator is located. >>>>>>>>>> >>>>>>>>>> We are serializing the hyperloglog to disk as part of a SSTable >>>>>>>>>> and we deserialize it back in runtime for every SSTable in a CF and >>>>>>>>>> we >>>>>>>>>> merge them all to one cardinality again. >>>>>>>>>> >>>>>>>>>> I do not think there is a way around this because of the nature >>>>>>>>>> of how a cardinality estimator works (hyperloglog). We can not >>>>>>>>>> "cache it", >>>>>>>>>> it would work only in case we are adding SSTables only - hence we >>>>>>>>>> would >>>>>>>>>> just merge again - but if we remove an SSTable as part of the >>>>>>>>>> compaction, >>>>>>>>>> we can not "unmerge" it. >>>>>>>>>> >>>>>>>>>> That being said, we are currently using this library for >>>>>>>>>> hyperloglog (1) which was archived in summer 2020 and nothing was >>>>>>>>>> contributed to that for 6 years. That lib is dead. >>>>>>>>>> >>>>>>>>>> There is very nice replacement of that (2) directly from Apache >>>>>>>>>> (!!!) and they are even giving the detailed and in-depth comparison >>>>>>>>>> of >>>>>>>>>> hyperloglog implementation found in stream-lib we happen to use (3) >>>>>>>>>> (stream-lib = Clearspring) where they are saying that updating is way >>>>>>>>>> faster and it is also giving better estimations in general. >>>>>>>>>> >>>>>>>>>> I have implemented the usage of both cardinality estimators (4), >>>>>>>>>> (5). The reason we need to keep the old one around is that we may >>>>>>>>>> have old >>>>>>>>>> SSTables around and we need to work with them too. That translates >>>>>>>>>> to a new >>>>>>>>>> SSTable version (ob) which uses new implementation and for versions >>>>>>>>>> < ob, >>>>>>>>>> it uses the old one. When SSTables are upgraded from oa to ob, the >>>>>>>>>> old >>>>>>>>>> estimator will not be used anymore. >>>>>>>>>> >>>>>>>>>> There is also a case of a user not upgrading his oa SSTables, >>>>>>>>>> turning a node on and creating new SSTables with ob version. When >>>>>>>>>> this >>>>>>>>>> happens and we ask what is the cardinality (e.g via nodetool >>>>>>>>>> tablestats), I >>>>>>>>>> am checking if all SSTables are on the same version or not. If they >>>>>>>>>> are, >>>>>>>>>> they will use either an old or new estimator. (we can not merge >>>>>>>>>> estimations >>>>>>>>>> from two different hyperloglog implementations). If they are not, it >>>>>>>>>> will >>>>>>>>>> compute that from index summaries. (The computation for index >>>>>>>>>> summaries was >>>>>>>>>> already in place (6) as a fail-over in case the estimation >>>>>>>>>> computation >>>>>>>>>> failed / was not present). >>>>>>>>>> >>>>>>>>>> Does this all make sense to drive further to the completion and >>>>>>>>>> eventually merge this work to trunk? >>>>>>>>>> >>>>>>>>>> Worth to add that Apache Datasketches are just two dependencies >>>>>>>>>> for us, it has zero external dependencies. >>>>>>>>>> >>>>>>>>>> (1) https://github.com/addthis/stream-lib >>>>>>>>>> (2) https://datasketches.apache.org/ >>>>>>>>>> (3) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html >>>>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338 >>>>>>>>>> (5) https://github.com/apache/cassandra/pull/3767 >>>>>>>>>> (6) >>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338 >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Dmitry Konstantinov >>>>>>> >>>>>> -- Dmitry Konstantinov