I think it makes sense to make the options more clear, I would suggest a Google sheet or a table within a JIRA ticket with options and comparison (it looks like majority of confusion in this topic is caused by different ways to interpret the suggestion :-) ) I see a table like this: +----------------------------------------------------------------------+-----------------------------------------+-----------------------------------------+ | Aspect | cache estimator in memory + Clearspring | cache estimator in memory + Clearspring | +----------------------------------------------------------------------+-----------------------------------------+-----------------------------------------+ | - license | Apache 2.0 | Apache 2.0 | | - is the library supported | no | yes | | - popularity (stars/projects who use it ) | ? | ? | | - dependencies | 1 | 0 | | - regression risks | low | high | | - estimation accuracy | ? | ? | | - efficiency (time/cpu) : to build estimator | ? | ? | | - efficiency (time/cpu) : to merge estimators to get per table stats | ? | ? | | - format migration complexity | no | yes | | - memory allocation overheads | ? | ? | +----------------------------------------------------------------------+-----------------------------------------+-----------------------------------------+
Regards, Dmitry On Thu, 2 Jan 2025 at 20:12, Štefan Miklošovič <smikloso...@apache.org> wrote: > Hi Benedict, > > you wrote: > > I am strongly opposed to updating libraries simply for the sake of it. > Something like HLL does not need much ongoing maintenance if it works. > We’re simply asking for extra work and bugs by switching, and some risk > without understanding the quality control for the new library project’s > releases. > > I understand this. But really, do you think that it is a bad idea to > switch to a well maintained library which is already used quite widely (the > website mentions extensions for sketches in Apache Druid, Hive, Pig, Pinot > and PostgreSQL) and using the library which was abandoned for 6 years? > > As I mentioned there is also extensive comparison with Clearspring (1) > where all performance benefits / speedups etc present in detail with charts > attached. > > I think this is a mature project, under Apache, so when we think that a 6 > years old and abandoned library is better than what Apache Datasketches > provides, then the question is what are we doing here? Are we not believing > what Apache itself offers and we need to rely on a 6 years old and dead > library instead of that? Huh? That lib has 3k commits, releases often, it's > a pretty active project ... > > I don't say that we should not test more deeply how it behaves, we might > even re-consider the parameters of hyperloglog as we do that. But I don't > think that having this library introduced would cause some kind of a > widespread / systemic risk. > > (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html > > On Thu, Jan 2, 2025 at 5:03 PM Benedict <bened...@apache.org> wrote: > >> I am strongly opposed to updating libraries simply for the sake of it. >> Something like HLL does not need much ongoing maintenance if it works. >> We’re simply asking for extra work and bugs by switching, and some risk >> without understanding the quality control for the new library project’s >> releases. >> >> That said, I was not very impressed with the clear spring library when I >> looked at it, so I would be open to a stronger argument about data sketches >> being superior otherwise in a way that matters to us. >> >> If we are to replace the library, we should at the very least do proper >> due diligence by reviewing the new library’s implementation(s) ourselves. >> We cannot simply assume the new library behaves well for our use cases, or >> is well maintained. >> >> We should also not use the fallback intersection method, as this would >> represent a regression to compaction on upgrade. We should really convert >> from one HLL to another. >> >> The proposal to reduce allocations appears to be orthogonal to this >> library, so let’s separate out that discussion? If there’s evidence this >> library alone improves the memory profile let’s discuss that. >> >> >> On 2 Jan 2025, at 15:26, Chris Lohfink <clohfin...@gmail.com> wrote: >> >> >> I think switching to datasketches is a good idea first off simply because >> of the lack of maintenance and improvements from clearspring. I am however, >> am not sold that it will actually improve anything significantly. Caches >> might help on small cases, but those small cases probably are not actually >> impacted. In the large cases the caches cost more in complexity, memory, >> and ultimately wont matter when theres 50k sstables and the cache holds 1k >> so everythings hitting disk anyway. >> >> The 5% is missing some relevant information like what the allocation rate >> was, how many tables there are etc. On an idle system thats meaningless, if >> there were 5gb/s allocations of reads/writes happening at the time thats >> huge. >> >> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >>> Interesting, thanks for this. Well ... 5% here, 5% there ... it >>> compounds. I think it is worth trying to do something with this. Would be >>> great if you were part of this effort! >>> >>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <netud...@gmail.com> >>> wrote: >>> >>>> I have seen this place in async profiler memory allocation profile on >>>> one of production environments some time ago, it was visible but not >>>> dramatic, about 5% of allocations: >>>> <image.png> >>>> >>>> The amount of overhead also depends on a metric collection frequency >>>> (in my case it was once per 60 seconds) >>>> >>>> Regards, >>>> Dmitry >>>> >>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič <smikloso...@apache.org> >>>> wrote: >>>> >>>>> Indeed, I plan to measure it and compare, maybe some bench test would >>>>> be cool to add .. >>>>> >>>>> I strongly suspect that the primary reason for the slowness (if it is >>>>> verified to be true) is us going to the disk every time and reading stats >>>>> for every SSTable all over again. >>>>> >>>>> While datasketches say that it is way faster to update (1), we are >>>>> living in a realm of nanoseconds here and I don't think that itself would >>>>> make any meaningful difference when merging one hyperloglog with another >>>>> as >>>>> part of partition rows estimation computation. The only place we are >>>>> updating is SortableTableWriter#endParition which calls >>>>> metadatacollector.addKey(key.getKey()) which eventually updates the >>>>> estimator via cardinality#offeredHashed. >>>>> >>>>> In other words, I think that going to the disk and reading it >>>>> repeatedly is disproportionally more IO / time intensive than switching >>>>> the >>>>> hyperloglog implementation. >>>>> >>>>> However, I consider the replacement of the library still important. I >>>>> feel uneasy about staying with an abandoned library where there is clearly >>>>> a well-maintained replacement. >>>>> >>>>> What we could do is to cache all cardinality estimators and just merge >>>>> it all when asked upon metric resolution. That is different from going to >>>>> disk to deserialize StatsComponent's all over again. >>>>> >>>>> Then on SSTable removal, we would remove that from cache too. I think >>>>> there is some kind of an observer when SSTable is removed ... >>>>> >>>>> However, I am not sure I can just hold it all in the memory, it works >>>>> for laptop testing but if we have thousands of SSTables with non-trivial >>>>> number of rows things start to get interesting. >>>>> >>>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html - >>>>> section HllSketch vs. HyperLogLogPlus Update Speed Behavior >>>>> >>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <j...@rustyrazorblade.com> >>>>> wrote: >>>>> >>>>>> Sounds interesting. I took a look at the issue but I'm not seeing >>>>>> any data to back up "expensive". Can this be quantified a bit more? >>>>>> >>>>>> Anytime we have a performance related issue, there should be some >>>>>> data to back it up, even if it seems obvious. >>>>>> >>>>>> Jon >>>>>> >>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič < >>>>>> smikloso...@apache.org> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I just stumbled upon this library we are using for getting >>>>>>> estimations of the number of partitions in a SSTable which are used >>>>>>> e.g. in >>>>>>> EstimatedPartitionCount metric. (1) >>>>>>> >>>>>>> A user reported in (1) that it is an expensive operation. When one >>>>>>> looks into what it is doing, it calls >>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically goes to disk >>>>>>> every single time, it loads all Stats components and it looks into >>>>>>> CompactionMetadata where the cardinality estimator is located. >>>>>>> >>>>>>> We are serializing the hyperloglog to disk as part of a SSTable and >>>>>>> we deserialize it back in runtime for every SSTable in a CF and we merge >>>>>>> them all to one cardinality again. >>>>>>> >>>>>>> I do not think there is a way around this because of the nature of >>>>>>> how a cardinality estimator works (hyperloglog). We can not "cache it", >>>>>>> it >>>>>>> would work only in case we are adding SSTables only - hence we would >>>>>>> just >>>>>>> merge again - but if we remove an SSTable as part of the compaction, we >>>>>>> can >>>>>>> not "unmerge" it. >>>>>>> >>>>>>> That being said, we are currently using this library for hyperloglog >>>>>>> (1) which was archived in summer 2020 and nothing was contributed to >>>>>>> that >>>>>>> for 6 years. That lib is dead. >>>>>>> >>>>>>> There is very nice replacement of that (2) directly from Apache >>>>>>> (!!!) and they are even giving the detailed and in-depth comparison of >>>>>>> hyperloglog implementation found in stream-lib we happen to use (3) >>>>>>> (stream-lib = Clearspring) where they are saying that updating is way >>>>>>> faster and it is also giving better estimations in general. >>>>>>> >>>>>>> I have implemented the usage of both cardinality estimators (4), >>>>>>> (5). The reason we need to keep the old one around is that we may have >>>>>>> old >>>>>>> SSTables around and we need to work with them too. That translates to a >>>>>>> new >>>>>>> SSTable version (ob) which uses new implementation and for versions < >>>>>>> ob, >>>>>>> it uses the old one. When SSTables are upgraded from oa to ob, the old >>>>>>> estimator will not be used anymore. >>>>>>> >>>>>>> There is also a case of a user not upgrading his oa SSTables, >>>>>>> turning a node on and creating new SSTables with ob version. When this >>>>>>> happens and we ask what is the cardinality (e.g via nodetool >>>>>>> tablestats), I >>>>>>> am checking if all SSTables are on the same version or not. If they are, >>>>>>> they will use either an old or new estimator. (we can not merge >>>>>>> estimations >>>>>>> from two different hyperloglog implementations). If they are not, it >>>>>>> will >>>>>>> compute that from index summaries. (The computation for index summaries >>>>>>> was >>>>>>> already in place (6) as a fail-over in case the estimation computation >>>>>>> failed / was not present). >>>>>>> >>>>>>> Does this all make sense to drive further to the completion and >>>>>>> eventually merge this work to trunk? >>>>>>> >>>>>>> Worth to add that Apache Datasketches are just two dependencies for >>>>>>> us, it has zero external dependencies. >>>>>>> >>>>>>> (1) https://github.com/addthis/stream-lib >>>>>>> (2) https://datasketches.apache.org/ >>>>>>> (3) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html >>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338 >>>>>>> (5) https://github.com/apache/cassandra/pull/3767 >>>>>>> (6) >>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338 >>>>>>> >>>>>>> Regards >>>>>>> >>>>>> >>>> >>>> -- >>>> Dmitry Konstantinov >>>> >>> -- Dmitry Konstantinov