Right ... that sounds reasonable. Let's "sleep on it" for a while. It is not something which is urgent to deal with right now but I find myself quite often to identify the functionality where we go to the disk more often than necessary and this was next on the list to take a look at reading CASSANDRA-13338. So I took a look ... and here we are.
If you guys go to bump SSTable version in 5.1 / 6.0, this change might be just shipped with that too. On Fri, Jan 3, 2025 at 1:47 PM Benedict <bened...@apache.org> wrote: > I’ve had a quick skim of the data sketches library, and it does seem to > have made some more efficient decisions in its design than clearspring, > appears to maybe support off-heap representations, and has reasonably good > documentation about the theoretical properties of the sketches. The chair > of the project is a published author on the topic, and the library has > newer algorithms for cardinality estimation than HLL. > > So, honestly, it might not be a bad idea to (carefully) consider a > migration, even if the current library isn’t broken for our needs. > > It would not be high up my priority list for the project, but I would > support it if it scratches someone’s itch. > > On 3 Jan 2025, at 12:16, Štefan Miklošovič <smikloso...@apache.org> wrote: > > > Okay ... first problems. > > These 2000 bytes I have mentioned in my response to Chris were indeed > correct, but that was with Datasketches and the main parameter for Hall > Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what we > currently have in Cassandra with Clearspring, that doubled the size to > ~4000 bytes. > > When we do not use Datasketches, what Clearspring generates is about ~5000 > bytes for the array itself but that array is wrapped into an ICardinality > object of Clearspring and we need that object in order to merge another > ICardinality into that. So, we would need to cache this ICardinality object > instead of just an array itself. If we don't cache whole ICardinality, we > would then need to do basically what > CompactionMetadata.CompactionMetadataSerializer.deserialize is doing which > would allocate a lot / often (ICardinality cardinality = > HyperLogLogPlus.Builder.build(that_cached_array)). > > To avoid the allocations every time we compute, we would just cache that > whole ICardinality of Clearspring, but that whole object measures like > 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables 500MB. > That is becoming quite a problem. > > On the other hand, HllSketch of Datasketches, array included, adds minimal > overhead. Like an array has 5000 bytes and the whole object like 5500. You > got the idea ... > > If we are still OK with these sizes, sure ... I am just being transparent > about the consequences here. > > A user would just opt-in into this (by default it would be turned off). > > On the other hand, if we have 10k SSTables, reading that 10+KB from disk > takes around 2-3ms so we would read the disk 20/30 seconds every time we > would hit that metric (and we haven't even started to merge the logs). > > If this is still not something which would sell Datasketches as a viable > alternative then I guess we need to stick to these numbers and cache it all > with Clearspring, occupying way more memory. > > On Thu, Jan 2, 2025 at 10:15 PM Benedict <bened...@apache.org> wrote: > >> I would like to see somebody who has some experience writing data >> structures, preferably someone we trust as a community to be competent at >> this (ie having some experience within the project contributing at this >> level), look at the code like they were at least lightly reviewing the >> feature as a contribution to this project. >> >> This should be the bar for any new library really, but triply so for >> replacing a library that works fine. >> >> On 2 Jan 2025, at 21:02, Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >> >> Point 2) is pretty hard to fulfil, I can not imagine what would be >> "enough" for you to be persuaded. What should concretely happen? Because >> whoever comes and says "yeah this is a good lib, it works" is probably not >> going to be enough given the vague requirements you put under 2) You would >> like to see exactly what? >> >> The way it looks to me is to just shut it down because of perceived churn >> caused by that and there will always be some argument against that. >> >> Based on (1) I don't think what we have is bug free. >> >> Jeff: >> >> Thank you for that answer, I think we are on the same page that caching >> it is just fine, that's what I got from your last two paragraphs. >> >> So the path from here is >> >> 1) add datasketches and cache >> 2) don't add datasketches and cache it anyway >> >> The introduction of datasketches lib is not the absolute must in order to >> achieve that, we can cache / compute it parallel with Clearspring as well, >> it is just a bitter-sweet solution which just doesn't feel right. >> >> (1) https://github.com/addthis/stream-lib/issues >> >> On Thu, Jan 2, 2025 at 9:26 PM Benedict <bened...@apache.org> wrote: >> >>> Your message seemed to be all about the caching proposal, which I have >>> proposed we separate, hence my confusion. >>> >>> To restate my answer to your question, I think that unless the new >>> library actually offers us concrete benefits we can point to that we >>> actually care about then yes it’s a bad idea to incur the churn of >>> migration. >>> >>> I’m not inherently opposed to a migration but simply “new is better” is >>> just plain wrong. Nothing you’ve presented yet convinces me this library is >>> worth the effort of vetting given our current solution works fine. >>> >>> My position is that for any new library we should: >>> >>> 1) Point to something it solves that we actually want and is worth the >>> time investment >>> 2) Solicit folk in the community competent in the relevant data >>> structures to vet the library for the proposed functionality >>> >>> The existing solution never went through (2) because it dates from the >>> dark ages where we just threw dependencies in willynilly. But it has the >>> benefit of having been used for a very long time without incident. >>> >>> >>> >>> On 2 Jan 2025, at 20:12, Štefan Miklošovič <smikloso...@apache.org> >>> wrote: >>> >>> >>> Hi Benedict, >>> >>> you wrote: >>> >>> I am strongly opposed to updating libraries simply for the sake of it. >>> Something like HLL does not need much ongoing maintenance if it works. >>> We’re simply asking for extra work and bugs by switching, and some risk >>> without understanding the quality control for the new library project’s >>> releases. >>> >>> I understand this. But really, do you think that it is a bad idea to >>> switch to a well maintained library which is already used quite widely (the >>> website mentions extensions for sketches in Apache Druid, Hive, Pig, Pinot >>> and PostgreSQL) and using the library which was abandoned for 6 years? >>> >>> As I mentioned there is also extensive comparison with Clearspring (1) >>> where all performance benefits / speedups etc present in detail with charts >>> attached. >>> >>> I think this is a mature project, under Apache, so when we think that a >>> 6 years old and abandoned library is better than what Apache Datasketches >>> provides, then the question is what are we doing here? Are we not believing >>> what Apache itself offers and we need to rely on a 6 years old and dead >>> library instead of that? Huh? That lib has 3k commits, releases often, it's >>> a pretty active project ... >>> >>> I don't say that we should not test more deeply how it behaves, we might >>> even re-consider the parameters of hyperloglog as we do that. But I don't >>> think that having this library introduced would cause some kind of a >>> widespread / systemic risk. >>> >>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html >>> >>> On Thu, Jan 2, 2025 at 5:03 PM Benedict <bened...@apache.org> wrote: >>> >>>> I am strongly opposed to updating libraries simply for the sake of it. >>>> Something like HLL does not need much ongoing maintenance if it works. >>>> We’re simply asking for extra work and bugs by switching, and some risk >>>> without understanding the quality control for the new library project’s >>>> releases. >>>> >>>> That said, I was not very impressed with the clear spring library when >>>> I looked at it, so I would be open to a stronger argument about data >>>> sketches being superior otherwise in a way that matters to us. >>>> >>>> If we are to replace the library, we should at the very least do proper >>>> due diligence by reviewing the new library’s implementation(s) ourselves. >>>> We cannot simply assume the new library behaves well for our use cases, or >>>> is well maintained. >>>> >>>> We should also not use the fallback intersection method, as this would >>>> represent a regression to compaction on upgrade. We should really convert >>>> from one HLL to another. >>>> >>>> The proposal to reduce allocations appears to be orthogonal to this >>>> library, so let’s separate out that discussion? If there’s evidence this >>>> library alone improves the memory profile let’s discuss that. >>>> >>>> >>>> On 2 Jan 2025, at 15:26, Chris Lohfink <clohfin...@gmail.com> wrote: >>>> >>>> >>>> I think switching to datasketches is a good idea first off simply >>>> because of the lack of maintenance and improvements from clearspring. I am >>>> however, am not sold that it will actually improve anything significantly. >>>> Caches might help on small cases, but those small cases probably are not >>>> actually impacted. In the large cases the caches cost more in complexity, >>>> memory, and ultimately wont matter when theres 50k sstables and the cache >>>> holds 1k so everythings hitting disk anyway. >>>> >>>> The 5% is missing some relevant information like what the allocation >>>> rate was, how many tables there are etc. On an idle system thats >>>> meaningless, if there were 5gb/s allocations of reads/writes happening at >>>> the time thats huge. >>>> >>>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič < >>>> smikloso...@apache.org> wrote: >>>> >>>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it >>>>> compounds. I think it is worth trying to do something with this. Would be >>>>> great if you were part of this effort! >>>>> >>>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <netud...@gmail.com> >>>>> wrote: >>>>> >>>>>> I have seen this place in async profiler memory allocation profile on >>>>>> one of production environments some time ago, it was visible but not >>>>>> dramatic, about 5% of allocations: >>>>>> <image.png> >>>>>> >>>>>> The amount of overhead also depends on a metric collection frequency >>>>>> (in my case it was once per 60 seconds) >>>>>> >>>>>> Regards, >>>>>> Dmitry >>>>>> >>>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič < >>>>>> smikloso...@apache.org> wrote: >>>>>> >>>>>>> Indeed, I plan to measure it and compare, maybe some bench test >>>>>>> would be cool to add .. >>>>>>> >>>>>>> I strongly suspect that the primary reason for the slowness (if it >>>>>>> is verified to be true) is us going to the disk every time and reading >>>>>>> stats for every SSTable all over again. >>>>>>> >>>>>>> While datasketches say that it is way faster to update (1), we are >>>>>>> living in a realm of nanoseconds here and I don't think that itself >>>>>>> would >>>>>>> make any meaningful difference when merging one hyperloglog with >>>>>>> another as >>>>>>> part of partition rows estimation computation. The only place we are >>>>>>> updating is SortableTableWriter#endParition which calls >>>>>>> metadatacollector.addKey(key.getKey()) which eventually updates the >>>>>>> estimator via cardinality#offeredHashed. >>>>>>> >>>>>>> In other words, I think that going to the disk and reading it >>>>>>> repeatedly is disproportionally more IO / time intensive than switching >>>>>>> the >>>>>>> hyperloglog implementation. >>>>>>> >>>>>>> However, I consider the replacement of the library still important. >>>>>>> I feel uneasy about staying with an abandoned library where there is >>>>>>> clearly a well-maintained replacement. >>>>>>> >>>>>>> What we could do is to cache all cardinality estimators and just >>>>>>> merge it all when asked upon metric resolution. That is different from >>>>>>> going to disk to deserialize StatsComponent's all over again. >>>>>>> >>>>>>> Then on SSTable removal, we would remove that from cache too. I >>>>>>> think there is some kind of an observer when SSTable is removed ... >>>>>>> >>>>>>> However, I am not sure I can just hold it all in the memory, it >>>>>>> works for laptop testing but if we have thousands of SSTables with >>>>>>> non-trivial number of rows things start to get interesting. >>>>>>> >>>>>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html - >>>>>>> section HllSketch vs. HyperLogLogPlus Update Speed Behavior >>>>>>> >>>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <j...@rustyrazorblade.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Sounds interesting. I took a look at the issue but I'm not seeing >>>>>>>> any data to back up "expensive". Can this be quantified a bit more? >>>>>>>> >>>>>>>> Anytime we have a performance related issue, there should be some >>>>>>>> data to back it up, even if it seems obvious. >>>>>>>> >>>>>>>> Jon >>>>>>>> >>>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič < >>>>>>>> smikloso...@apache.org> wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I just stumbled upon this library we are using for getting >>>>>>>>> estimations of the number of partitions in a SSTable which are used >>>>>>>>> e.g. in >>>>>>>>> EstimatedPartitionCount metric. (1) >>>>>>>>> >>>>>>>>> A user reported in (1) that it is an expensive operation. When one >>>>>>>>> looks into what it is doing, it calls >>>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically goes to >>>>>>>>> disk >>>>>>>>> every single time, it loads all Stats components and it looks into >>>>>>>>> CompactionMetadata where the cardinality estimator is located. >>>>>>>>> >>>>>>>>> We are serializing the hyperloglog to disk as part of a SSTable >>>>>>>>> and we deserialize it back in runtime for every SSTable in a CF and we >>>>>>>>> merge them all to one cardinality again. >>>>>>>>> >>>>>>>>> I do not think there is a way around this because of the nature of >>>>>>>>> how a cardinality estimator works (hyperloglog). We can not "cache >>>>>>>>> it", it >>>>>>>>> would work only in case we are adding SSTables only - hence we would >>>>>>>>> just >>>>>>>>> merge again - but if we remove an SSTable as part of the compaction, >>>>>>>>> we can >>>>>>>>> not "unmerge" it. >>>>>>>>> >>>>>>>>> That being said, we are currently using this library for >>>>>>>>> hyperloglog (1) which was archived in summer 2020 and nothing was >>>>>>>>> contributed to that for 6 years. That lib is dead. >>>>>>>>> >>>>>>>>> There is very nice replacement of that (2) directly from Apache >>>>>>>>> (!!!) and they are even giving the detailed and in-depth comparison of >>>>>>>>> hyperloglog implementation found in stream-lib we happen to use (3) >>>>>>>>> (stream-lib = Clearspring) where they are saying that updating is way >>>>>>>>> faster and it is also giving better estimations in general. >>>>>>>>> >>>>>>>>> I have implemented the usage of both cardinality estimators (4), >>>>>>>>> (5). The reason we need to keep the old one around is that we may >>>>>>>>> have old >>>>>>>>> SSTables around and we need to work with them too. That translates to >>>>>>>>> a new >>>>>>>>> SSTable version (ob) which uses new implementation and for versions < >>>>>>>>> ob, >>>>>>>>> it uses the old one. When SSTables are upgraded from oa to ob, the old >>>>>>>>> estimator will not be used anymore. >>>>>>>>> >>>>>>>>> There is also a case of a user not upgrading his oa SSTables, >>>>>>>>> turning a node on and creating new SSTables with ob version. When this >>>>>>>>> happens and we ask what is the cardinality (e.g via nodetool >>>>>>>>> tablestats), I >>>>>>>>> am checking if all SSTables are on the same version or not. If they >>>>>>>>> are, >>>>>>>>> they will use either an old or new estimator. (we can not merge >>>>>>>>> estimations >>>>>>>>> from two different hyperloglog implementations). If they are not, it >>>>>>>>> will >>>>>>>>> compute that from index summaries. (The computation for index >>>>>>>>> summaries was >>>>>>>>> already in place (6) as a fail-over in case the estimation computation >>>>>>>>> failed / was not present). >>>>>>>>> >>>>>>>>> Does this all make sense to drive further to the completion and >>>>>>>>> eventually merge this work to trunk? >>>>>>>>> >>>>>>>>> Worth to add that Apache Datasketches are just two dependencies >>>>>>>>> for us, it has zero external dependencies. >>>>>>>>> >>>>>>>>> (1) https://github.com/addthis/stream-lib >>>>>>>>> (2) https://datasketches.apache.org/ >>>>>>>>> (3) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html >>>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338 >>>>>>>>> (5) https://github.com/apache/cassandra/pull/3767 >>>>>>>>> (6) >>>>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338 >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>> >>>>>> >>>>>> -- >>>>>> Dmitry Konstantinov >>>>>> >>>>>