Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Brian Proffitt Fri, 03 Jan 2025 06:12:51 -0800

Dmitry:

You are using a section of the Confluence wiki that is dedicated to Community 
Over Code, the Apache Conference. Please move that page to a more appropriate 
part of the Apache wiki as soon as you can.


Thanks!
BKP

On 2025/01/03 13:55:49 Dmitry Konstantinov wrote:
> I have summarized information from this mail thread to
> https://cwiki.apache.org/confluence/display/COC/SSTable%27s+partition+cardinality+implementation
> Probably later it can be transformed to a CEP..
> Regarding experience of DataSketches library's authors and publications
> here there is a good summary in Background section:
> https://cwiki.apache.org/confluence/display/INCUBATOR/DataSketchesProposal
> . It looks good..
> 
> On Fri, 3 Jan 2025 at 13:06, Štefan Miklošovič <[email protected]>
> wrote:
> 
> > Right ... that sounds reasonable. Let's "sleep on it" for a while. It is
> > not something which is urgent to deal with right now but I find myself
> > quite often to identify the functionality where we go to the disk more
> > often than necessary and this was next on the list to take a look at
> > reading CASSANDRA-13338. So I took a look ... and here we are.
> >
> > If you guys go to bump SSTable version in 5.1 / 6.0, this change might be
> > just shipped with that too.
> >
> > On Fri, Jan 3, 2025 at 1:47 PM Benedict <[email protected]> wrote:
> >
> >> I’ve had a quick skim of the data sketches library, and it does seem to
> >> have made some more efficient decisions in its design than clearspring,
> >> appears to maybe support off-heap representations, and has reasonably good
> >> documentation about the theoretical properties of the sketches. The chair
> >> of the project is a published author on the topic, and the library has
> >> newer algorithms for cardinality estimation than HLL.
> >>
> >> So, honestly, it might not be a bad idea to (carefully) consider a
> >> migration, even if the current library isn’t broken for our needs.
> >>
> >> It would not be high up my priority list for the project, but I would
> >> support it if it scratches someone’s itch.
> >>
> >> On 3 Jan 2025, at 12:16, Štefan Miklošovič <[email protected]>
> >> wrote:
> >>
> >> 
> >> Okay ... first problems.
> >>
> >> These 2000 bytes I have mentioned in my response to Chris were indeed
> >> correct, but that was with Datasketches and the main parameter for Hall
> >> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what we
> >> currently have in Cassandra with Clearspring, that doubled the size to
> >> ~4000 bytes.
> >>
> >> When we do not use Datasketches, what Clearspring generates is about
> >> ~5000 bytes for the array itself but that array is wrapped into an
> >> ICardinality object of Clearspring and we need that object in order to
> >> merge another ICardinality into that. So, we would need to cache this
> >> ICardinality object instead of just an array itself. If we don't cache
> >> whole ICardinality, we would then need to do basically what
> >> CompactionMetadata.CompactionMetadataSerializer.deserialize is doing which
> >> would allocate a lot / often (ICardinality cardinality =
> >> HyperLogLogPlus.Builder.build(that_cached_array)).
> >>
> >> To avoid the allocations every time we compute, we would just cache that
> >> whole ICardinality of Clearspring, but that whole object measures like
> >> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables 500MB.
> >> That is becoming quite a problem.
> >>
> >> On the other hand, HllSketch of Datasketches, array included, adds
> >> minimal overhead. Like an array has 5000 bytes and the whole object like
> >> 5500. You got the idea ...
> >>
> >> If we are still OK with these sizes, sure ... I am just being transparent
> >> about the consequences here.
> >>
> >> A user would just opt-in into this (by default it would be turned off).
> >>
> >> On the other hand, if we have 10k SSTables, reading that 10+KB from disk
> >> takes around 2-3ms so we would read the disk 20/30 seconds every time we
> >> would hit that metric (and we haven't even started to merge the logs).
> >>
> >> If this is still not something which would sell Datasketches as a viable
> >> alternative then I guess we need to stick to these numbers and cache it all
> >> with Clearspring, occupying way more memory.
> >>
> >> On Thu, Jan 2, 2025 at 10:15 PM Benedict <[email protected]> wrote:
> >>
> >>> I would like to see somebody who has some experience writing data
> >>> structures, preferably someone we trust as a community to be competent at
> >>> this (ie having some experience within the project contributing at this
> >>> level), look at the code like they were at least lightly reviewing the
> >>> feature as a contribution to this project.
> >>>
> >>> This should be the bar for any new library really, but triply so for
> >>> replacing a library that works fine.
> >>>
> >>> On 2 Jan 2025, at 21:02, Štefan Miklošovič <[email protected]>
> >>> wrote:
> >>>
> >>> 
> >>> Point 2) is pretty hard to fulfil, I can not imagine what would be
> >>> "enough" for you to be persuaded. What should concretely happen? Because
> >>> whoever comes and says "yeah this is a good lib, it works" is probably not
> >>> going to be enough given the vague requirements you put under 2) You would
> >>> like to see exactly what?
> >>>
> >>> The way it looks to me is to just shut it down because of perceived
> >>> churn caused by that and there will always be some argument against that.
> >>>
> >>> Based on (1) I don't think what we have is bug free.
> >>>
> >>> Jeff:
> >>>
> >>> Thank you for that answer, I think we are on the same page that caching
> >>> it is just fine, that's what I got from your last two paragraphs.
> >>>
> >>> So the path from here is
> >>>
> >>> 1) add datasketches and cache
> >>> 2) don't add datasketches and cache it anyway
> >>>
> >>> The introduction of datasketches lib is not the absolute must in order
> >>> to achieve that, we can cache / compute it parallel with Clearspring as
> >>> well, it is just a bitter-sweet solution which just doesn't feel right.
> >>>
> >>> (1) https://github.com/addthis/stream-lib/issues
> >>>
> >>> On Thu, Jan 2, 2025 at 9:26 PM Benedict <[email protected]> wrote:
> >>>
> >>>> Your message seemed to be all about the caching proposal, which I have
> >>>> proposed we separate, hence my confusion.
> >>>>
> >>>> To restate my answer to your question, I think that unless the new
> >>>> library actually offers us concrete benefits we can point to that we
> >>>> actually care about then yes it’s a bad idea to incur the churn of
> >>>> migration.
> >>>>
> >>>> I’m not inherently opposed to a migration but simply “new is better” is
> >>>> just plain wrong. Nothing you’ve presented yet convinces me this library 
> >>>> is
> >>>> worth the effort of vetting given our current solution works fine.
> >>>>
> >>>> My position is that for any new library we should:
> >>>>
> >>>> 1) Point to something it solves that we actually want and is worth the
> >>>> time investment
> >>>> 2) Solicit folk in the community competent in the relevant data
> >>>> structures to vet the library for the proposed functionality
> >>>>
> >>>> The existing solution never went through (2) because it dates from the
> >>>> dark ages where we just threw dependencies in willynilly. But it has the
> >>>> benefit of having been used for a very long time without incident.
> >>>>
> >>>>
> >>>>
> >>>> On 2 Jan 2025, at 20:12, Štefan Miklošovič <[email protected]>
> >>>> wrote:
> >>>>
> >>>> 
> >>>> Hi Benedict,
> >>>>
> >>>> you wrote:
> >>>>
> >>>> I am strongly opposed to updating libraries simply for the sake of it.
> >>>> Something like HLL does not need much ongoing maintenance if it works.
> >>>> We’re simply asking for extra work and bugs by switching, and some risk
> >>>> without understanding the quality control for the new library project’s
> >>>> releases.
> >>>>
> >>>> I understand this. But really, do you think that it is a bad idea to
> >>>> switch to a well maintained library which is already used quite widely 
> >>>> (the
> >>>> website mentions extensions for sketches in Apache Druid, Hive, Pig, 
> >>>> Pinot
> >>>> and PostgreSQL) and using the library which was abandoned for 6 years?
> >>>>
> >>>> As I mentioned there is also extensive comparison with Clearspring (1)
> >>>> where all performance benefits / speedups etc present in detail with 
> >>>> charts
> >>>> attached.
> >>>>
> >>>> I think this is a mature project, under Apache, so when we think that a
> >>>> 6 years old and abandoned library is better than what Apache Datasketches
> >>>> provides, then the question is what are we doing here? Are we not 
> >>>> believing
> >>>> what Apache itself offers and we need to rely on a 6 years old and dead
> >>>> library instead of that? Huh? That lib has 3k commits, releases often, 
> >>>> it's
> >>>> a pretty active project ...
> >>>>
> >>>> I don't say that we should not test more deeply how it behaves, we
> >>>> might even re-consider the parameters of hyperloglog as we do that. But I
> >>>> don't think that having this library introduced would cause some kind of 
> >>>> a
> >>>> widespread / systemic risk.
> >>>>
> >>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
> >>>>
> >>>> On Thu, Jan 2, 2025 at 5:03 PM Benedict <[email protected]> wrote:
> >>>>
> >>>>> I am strongly opposed to updating libraries simply for the sake of it.
> >>>>> Something like HLL does not need much ongoing maintenance if it works.
> >>>>> We’re simply asking for extra work and bugs by switching, and some risk
> >>>>> without understanding the quality control for the new library project’s
> >>>>> releases.
> >>>>>
> >>>>> That said, I was not very impressed with the clear spring library when
> >>>>> I looked at it, so I would be open to a stronger argument about data
> >>>>> sketches being superior otherwise in a way that matters to us.
> >>>>>
> >>>>> If we are to replace the library, we should at the very least do
> >>>>> proper due diligence by reviewing the new library’s implementation(s)
> >>>>> ourselves. We cannot simply assume the new library behaves well for our 
> >>>>> use
> >>>>> cases, or is well maintained.
> >>>>>
> >>>>> We should also not use the fallback intersection method, as this would
> >>>>> represent a regression to compaction on upgrade. We should really 
> >>>>> convert
> >>>>> from one HLL to another.
> >>>>>
> >>>>> The proposal to reduce allocations appears to be orthogonal to this
> >>>>> library, so let’s separate out that discussion? If there’s evidence this
> >>>>> library alone improves the memory profile let’s discuss that.
> >>>>>
> >>>>>
> >>>>> On 2 Jan 2025, at 15:26, Chris Lohfink <[email protected]> wrote:
> >>>>>
> >>>>> 
> >>>>> I think switching to datasketches is a good idea first off simply
> >>>>> because of the lack of maintenance and improvements from clearspring. I 
> >>>>> am
> >>>>> however, am not sold that it will actually improve anything 
> >>>>> significantly.
> >>>>> Caches might help on small cases, but those small cases probably are not
> >>>>> actually impacted. In the large cases the caches cost more in 
> >>>>> complexity,
> >>>>> memory, and ultimately wont matter when theres 50k sstables and the 
> >>>>> cache
> >>>>> holds 1k so everythings hitting disk anyway.
> >>>>>
> >>>>> The 5% is missing some relevant information like what the allocation
> >>>>> rate was, how many tables there are etc. On an idle system thats
> >>>>> meaningless, if there were 5gb/s allocations of reads/writes happening 
> >>>>> at
> >>>>> the time thats huge.
> >>>>>
> >>>>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it
> >>>>>> compounds. I think it is worth trying to do something with this. Would 
> >>>>>> be
> >>>>>> great if you were part of this effort!
> >>>>>>
> >>>>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>> I have seen this place in async profiler memory allocation profile
> >>>>>>> on one of production environments some time ago, it was visible but 
> >>>>>>> not
> >>>>>>> dramatic, about 5% of allocations:
> >>>>>>> <image.png>
> >>>>>>>
> >>>>>>> The amount of overhead also depends on a metric collection frequency
> >>>>>>> (in my case it was once per 60 seconds)
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Dmitry
> >>>>>>>
> >>>>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> Indeed, I plan to measure it and compare, maybe some bench test
> >>>>>>>> would be cool to add ..
> >>>>>>>>
> >>>>>>>> I strongly suspect that the primary reason for the slowness (if it
> >>>>>>>> is verified to be true) is us going to the disk every time and 
> >>>>>>>> reading
> >>>>>>>> stats for every SSTable all over again.
> >>>>>>>>
> >>>>>>>> While datasketches say that it is way faster to update (1), we are
> >>>>>>>> living in a realm of nanoseconds here and I don't think that itself 
> >>>>>>>> would
> >>>>>>>> make any meaningful difference when merging one hyperloglog with 
> >>>>>>>> another as
> >>>>>>>> part of partition rows estimation computation. The only place we are
> >>>>>>>> updating is SortableTableWriter#endParition which calls
> >>>>>>>> metadatacollector.addKey(key.getKey()) which eventually updates the
> >>>>>>>> estimator via cardinality#offeredHashed.
> >>>>>>>>
> >>>>>>>> In other words, I think that going to the disk and reading it
> >>>>>>>> repeatedly is disproportionally more IO / time intensive than 
> >>>>>>>> switching the
> >>>>>>>> hyperloglog implementation.
> >>>>>>>>
> >>>>>>>> However, I consider the replacement of the library still important.
> >>>>>>>> I feel uneasy about staying with an abandoned library where there is
> >>>>>>>> clearly a well-maintained replacement.
> >>>>>>>>
> >>>>>>>> What we could do is to cache all cardinality estimators and just
> >>>>>>>> merge it all when asked upon metric resolution. That is different 
> >>>>>>>> from
> >>>>>>>> going to disk to deserialize StatsComponent's all over again.
> >>>>>>>>
> >>>>>>>> Then on SSTable removal, we would remove that from cache too. I
> >>>>>>>> think there is some kind of an observer when SSTable is removed ...
> >>>>>>>>
> >>>>>>>> However, I am not sure I can just hold it all in the memory, it
> >>>>>>>> works for laptop testing but if we have thousands of SSTables with
> >>>>>>>> non-trivial number of rows things start to get interesting.
> >>>>>>>>
> >>>>>>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
> >>>>>>>> - section HllSketch vs. HyperLogLogPlus Update Speed Behavior
> >>>>>>>>
> >>>>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <[email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Sounds interesting.  I took a look at the issue but I'm not seeing
> >>>>>>>>> any data to back up "expensive".  Can this be quantified a bit more?
> >>>>>>>>>
> >>>>>>>>> Anytime we have a performance related issue, there should be some
> >>>>>>>>> data to back it up, even if it seems obvious.
> >>>>>>>>>
> >>>>>>>>> Jon
> >>>>>>>>>
> >>>>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič <
> >>>>>>>>> [email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hello,
> >>>>>>>>>>
> >>>>>>>>>> I just stumbled upon this library we are using for getting
> >>>>>>>>>> estimations of the number of partitions in a SSTable which are 
> >>>>>>>>>> used e.g. in
> >>>>>>>>>> EstimatedPartitionCount metric. (1)
> >>>>>>>>>>
> >>>>>>>>>> A user reported in (1) that it is an expensive operation. When
> >>>>>>>>>> one looks into what it is doing, it calls
> >>>>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically goes to 
> >>>>>>>>>> disk
> >>>>>>>>>> every single time, it loads all Stats components and it looks into
> >>>>>>>>>> CompactionMetadata where the cardinality estimator is located.
> >>>>>>>>>>
> >>>>>>>>>> We are serializing the hyperloglog to disk as part of a SSTable
> >>>>>>>>>> and we deserialize it back in runtime for every SSTable in a CF 
> >>>>>>>>>> and we
> >>>>>>>>>> merge them all to one cardinality again.
> >>>>>>>>>>
> >>>>>>>>>> I do not think there is a way around this because of the nature
> >>>>>>>>>> of how a cardinality estimator works (hyperloglog). We can not 
> >>>>>>>>>> "cache it",
> >>>>>>>>>> it would work only in case we are adding SSTables only - hence we 
> >>>>>>>>>> would
> >>>>>>>>>> just merge again - but if we remove an SSTable as part of the 
> >>>>>>>>>> compaction,
> >>>>>>>>>> we can not "unmerge" it.
> >>>>>>>>>>
> >>>>>>>>>> That being said, we are currently using this library for
> >>>>>>>>>> hyperloglog (1) which was archived in summer 2020 and nothing was
> >>>>>>>>>> contributed to that for 6 years. That lib is dead.
> >>>>>>>>>>
> >>>>>>>>>> There is very nice replacement of that (2) directly from Apache
> >>>>>>>>>> (!!!) and they are even giving the detailed and in-depth 
> >>>>>>>>>> comparison of
> >>>>>>>>>> hyperloglog implementation found in stream-lib we happen to use (3)
> >>>>>>>>>> (stream-lib = Clearspring) where they are saying that updating is 
> >>>>>>>>>> way
> >>>>>>>>>> faster and it is also giving better estimations in general.
> >>>>>>>>>>
> >>>>>>>>>> I have implemented the usage of both cardinality estimators (4),
> >>>>>>>>>> (5). The reason we need to keep the old one around is that we may 
> >>>>>>>>>> have old
> >>>>>>>>>> SSTables around and we need to work with them too. That translates 
> >>>>>>>>>> to a new
> >>>>>>>>>> SSTable version (ob) which uses new implementation and for 
> >>>>>>>>>> versions < ob,
> >>>>>>>>>> it uses the old one. When SSTables are upgraded from oa to ob, the 
> >>>>>>>>>> old
> >>>>>>>>>> estimator will not be used anymore.
> >>>>>>>>>>
> >>>>>>>>>> There is also a case of a user not upgrading his oa SSTables,
> >>>>>>>>>> turning a node on and creating new SSTables with ob version. When 
> >>>>>>>>>> this
> >>>>>>>>>> happens and we ask what is the cardinality (e.g via nodetool 
> >>>>>>>>>> tablestats), I
> >>>>>>>>>> am checking if all SSTables are on the same version or not. If 
> >>>>>>>>>> they are,
> >>>>>>>>>> they will use either an old or new estimator. (we can not merge 
> >>>>>>>>>> estimations
> >>>>>>>>>> from two different hyperloglog implementations). If they are not, 
> >>>>>>>>>> it will
> >>>>>>>>>> compute that from index summaries. (The computation for index 
> >>>>>>>>>> summaries was
> >>>>>>>>>> already in place (6) as a fail-over in case the estimation 
> >>>>>>>>>> computation
> >>>>>>>>>> failed / was not present).
> >>>>>>>>>>
> >>>>>>>>>> Does this all make sense to drive further to the completion and
> >>>>>>>>>> eventually merge this work to trunk?
> >>>>>>>>>>
> >>>>>>>>>> Worth to add that Apache Datasketches are just two dependencies
> >>>>>>>>>> for us, it has zero external dependencies.
> >>>>>>>>>>
> >>>>>>>>>> (1) https://github.com/addthis/stream-lib
> >>>>>>>>>> (2) https://datasketches.apache.org/
> >>>>>>>>>> (3) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
> >>>>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338
> >>>>>>>>>> (5) https://github.com/apache/cassandra/pull/3767
> >>>>>>>>>> (6)
> >>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Dmitry Konstantinov
> >>>>>>>
> >>>>>>
> 
> -- 
> Dmitry Konstantinov
>

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Reply via email to