Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Dmitry Konstantinov Thu, 02 Jan 2025 12:21:51 -0800

I think it makes sense to make the options more clear, I would suggest a
Google sheet or a table within a JIRA ticket with options and comparison
(it looks like majority of confusion in this topic is caused by different
ways to interpret the suggestion :-) )
I see a table like this:
+----------------------------------------------------------------------+-----------------------------------------+-----------------------------------------+
|                                Aspect                                |
cache estimator in memory + Clearspring | cache estimator in memory +
Clearspring |
+----------------------------------------------------------------------+-----------------------------------------+-----------------------------------------+
| - license                                                            |
Apache 2.0                              | Apache 2.0
       |
| - is the library supported                                           | no
                                     | yes
    |
| - popularity (stars/projects who use it )                            | ?
                                      | ?
    |
| - dependencies                                                       | 1
                                      | 0
    |
| - regression risks                                                   |
low                                     | high
       |
| - estimation accuracy                                                | ?
                                      | ?
    |
| - efficiency (time/cpu) : to build estimator                         | ?
                                      | ?
    |
| - efficiency (time/cpu) : to merge estimators to get per table stats | ?
                                      | ?
    |
| - format migration complexity                                        | no
                                     | yes
    |
| - memory allocation overheads                                        | ?
                                      | ?
    |
+----------------------------------------------------------------------+-----------------------------------------+-----------------------------------------+


Regards,
Dmitry

On Thu, 2 Jan 2025 at 20:12, Štefan Miklošovič <smikloso...@apache.org>
wrote:

> Hi Benedict,
>
> you wrote:
>
> I am strongly opposed to updating libraries simply for the sake of it.
> Something like HLL does not need much ongoing maintenance if it works.
> We’re simply asking for extra work and bugs by switching, and some risk
> without understanding the quality control for the new library project’s
> releases.
>
> I understand this. But really, do you think that it is a bad idea to
> switch to a well maintained library which is already used quite widely (the
> website mentions extensions for sketches in Apache Druid, Hive, Pig, Pinot
> and PostgreSQL) and using the library which was abandoned for 6 years?
>
> As I mentioned there is also extensive comparison with Clearspring (1)
> where all performance benefits / speedups etc present in detail with charts
> attached.
>
> I think this is a mature project, under Apache, so when we think that a 6
> years old and abandoned library is better than what Apache Datasketches
> provides, then the question is what are we doing here? Are we not believing
> what Apache itself offers and we need to rely on a 6 years old and dead
> library instead of that? Huh? That lib has 3k commits, releases often, it's
> a pretty active project ...
>
> I don't say that we should not test more deeply how it behaves, we might
> even re-consider the parameters of hyperloglog as we do that. But I don't
> think that having this library introduced would cause some kind of a
> widespread / systemic risk.
>
> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>
> On Thu, Jan 2, 2025 at 5:03 PM Benedict <bened...@apache.org> wrote:
>
>> I am strongly opposed to updating libraries simply for the sake of it.
>> Something like HLL does not need much ongoing maintenance if it works.
>> We’re simply asking for extra work and bugs by switching, and some risk
>> without understanding the quality control for the new library project’s
>> releases.
>>
>> That said, I was not very impressed with the clear spring library when I
>> looked at it, so I would be open to a stronger argument about data sketches
>> being superior otherwise in a way that matters to us.
>>
>> If we are to replace the library, we should at the very least do proper
>> due diligence by reviewing the new library’s implementation(s) ourselves.
>> We cannot simply assume the new library behaves well for our use cases, or
>> is well maintained.
>>
>> We should also not use the fallback intersection method, as this would
>> represent a regression to compaction on upgrade. We should really convert
>> from one HLL to another.
>>
>> The proposal to reduce allocations appears to be orthogonal to this
>> library, so let’s separate out that discussion? If there’s evidence this
>> library alone improves the memory profile let’s discuss that.
>>
>>
>> On 2 Jan 2025, at 15:26, Chris Lohfink <clohfin...@gmail.com> wrote:
>>
>> 
>> I think switching to datasketches is a good idea first off simply because
>> of the lack of maintenance and improvements from clearspring. I am however,
>> am not sold that it will actually improve anything significantly. Caches
>> might help on small cases, but those small cases probably are not actually
>> impacted. In the large cases the caches cost more in complexity, memory,
>> and ultimately wont matter when theres 50k sstables and the cache holds 1k
>> so everythings hitting disk anyway.
>>
>> The 5% is missing some relevant information like what the allocation rate
>> was, how many tables there are etc. On an idle system thats meaningless, if
>> there were 5gb/s allocations of reads/writes happening at the time thats
>> huge.
>>
>> On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič <smikloso...@apache.org>
>> wrote:
>>
>>> Interesting, thanks for this. Well ... 5% here, 5% there ... it
>>> compounds. I think it is worth trying to do something with this. Would be
>>> great if you were part of this effort!
>>>
>>> On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <netud...@gmail.com>
>>> wrote:
>>>
>>>> I have seen this place in async profiler memory allocation profile on
>>>> one of production environments some time ago, it was visible but not
>>>> dramatic, about 5% of allocations:
>>>> <image.png>
>>>>
>>>> The amount of overhead also depends on a metric collection frequency
>>>> (in my case it was once per 60 seconds)
>>>>
>>>> Regards,
>>>> Dmitry
>>>>
>>>> On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič <smikloso...@apache.org>
>>>> wrote:
>>>>
>>>>> Indeed, I plan to measure it and compare, maybe some bench test would
>>>>> be cool to add ..
>>>>>
>>>>> I strongly suspect that the primary reason for the slowness (if it is
>>>>> verified to be true) is us going to the disk every time and reading stats
>>>>> for every SSTable all over again.
>>>>>
>>>>> While datasketches say that it is way faster to update (1), we are
>>>>> living in a realm of nanoseconds here and I don't think that itself would
>>>>> make any meaningful difference when merging one hyperloglog with another 
>>>>> as
>>>>> part of partition rows estimation computation. The only place we are
>>>>> updating is SortableTableWriter#endParition which calls
>>>>> metadatacollector.addKey(key.getKey()) which eventually updates the
>>>>> estimator via cardinality#offeredHashed.
>>>>>
>>>>> In other words, I think that going to the disk and reading it
>>>>> repeatedly is disproportionally more IO / time intensive than switching 
>>>>> the
>>>>> hyperloglog implementation.
>>>>>
>>>>> However, I consider the replacement of the library still important. I
>>>>> feel uneasy about staying with an abandoned library where there is clearly
>>>>> a well-maintained replacement.
>>>>>
>>>>> What we could do is to cache all cardinality estimators and just merge
>>>>> it all when asked upon metric resolution. That is different from going to
>>>>> disk to deserialize StatsComponent's all over again.
>>>>>
>>>>> Then on SSTable removal, we would remove that from cache too. I think
>>>>> there is some kind of an observer when SSTable is removed ...
>>>>>
>>>>> However, I am not sure I can just hold it all in the memory, it works
>>>>> for laptop testing but if we have thousands of SSTables with non-trivial
>>>>> number of rows things start to get interesting.
>>>>>
>>>>> (1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html -
>>>>> section HllSketch vs. HyperLogLogPlus Update Speed Behavior
>>>>>
>>>>> On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <j...@rustyrazorblade.com>
>>>>> wrote:
>>>>>
>>>>>> Sounds interesting.  I took a look at the issue but I'm not seeing
>>>>>> any data to back up "expensive".  Can this be quantified a bit more?
>>>>>>
>>>>>> Anytime we have a performance related issue, there should be some
>>>>>> data to back it up, even if it seems obvious.
>>>>>>
>>>>>> Jon
>>>>>>
>>>>>> On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič <
>>>>>> smikloso...@apache.org> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I just stumbled upon this library we are using for getting
>>>>>>> estimations of the number of partitions in a SSTable which are used 
>>>>>>> e.g. in
>>>>>>> EstimatedPartitionCount metric. (1)
>>>>>>>
>>>>>>> A user reported in (1) that it is an expensive operation. When one
>>>>>>> looks into what it is doing, it calls
>>>>>>> SSTableReader.getApproximateKeyCount() (6) which basically goes to disk
>>>>>>> every single time, it loads all Stats components and it looks into
>>>>>>> CompactionMetadata where the cardinality estimator is located.
>>>>>>>
>>>>>>> We are serializing the hyperloglog to disk as part of a SSTable and
>>>>>>> we deserialize it back in runtime for every SSTable in a CF and we merge
>>>>>>> them all to one cardinality again.
>>>>>>>
>>>>>>> I do not think there is a way around this because of the nature of
>>>>>>> how a cardinality estimator works (hyperloglog). We can not "cache it", 
>>>>>>> it
>>>>>>> would work only in case we are adding SSTables only - hence we would 
>>>>>>> just
>>>>>>> merge again - but if we remove an SSTable as part of the compaction, we 
>>>>>>> can
>>>>>>> not "unmerge" it.
>>>>>>>
>>>>>>> That being said, we are currently using this library for hyperloglog
>>>>>>> (1) which was archived in summer 2020 and nothing was contributed to 
>>>>>>> that
>>>>>>> for 6 years. That lib is dead.
>>>>>>>
>>>>>>> There is very nice replacement of that (2) directly from Apache
>>>>>>> (!!!) and they are even giving the detailed and in-depth comparison of
>>>>>>> hyperloglog implementation found in stream-lib we happen to use (3)
>>>>>>> (stream-lib = Clearspring) where they are saying that updating is way
>>>>>>> faster and it is also giving better estimations in general.
>>>>>>>
>>>>>>> I have implemented the usage of both cardinality estimators (4),
>>>>>>> (5). The reason we need to keep the old one around is that we may have 
>>>>>>> old
>>>>>>> SSTables around and we need to work with them too. That translates to a 
>>>>>>> new
>>>>>>> SSTable version (ob) which uses new implementation and for versions < 
>>>>>>> ob,
>>>>>>> it uses the old one. When SSTables are upgraded from oa to ob, the old
>>>>>>> estimator will not be used anymore.
>>>>>>>
>>>>>>> There is also a case of a user not upgrading his oa SSTables,
>>>>>>> turning a node on and creating new SSTables with ob version. When this
>>>>>>> happens and we ask what is the cardinality (e.g via nodetool 
>>>>>>> tablestats), I
>>>>>>> am checking if all SSTables are on the same version or not. If they are,
>>>>>>> they will use either an old or new estimator. (we can not merge 
>>>>>>> estimations
>>>>>>> from two different hyperloglog implementations). If they are not, it 
>>>>>>> will
>>>>>>> compute that from index summaries. (The computation for index summaries 
>>>>>>> was
>>>>>>> already in place (6) as a fail-over in case the estimation computation
>>>>>>> failed / was not present).
>>>>>>>
>>>>>>> Does this all make sense to drive further to the completion and
>>>>>>> eventually merge this work to trunk?
>>>>>>>
>>>>>>> Worth to add that Apache Datasketches are just two dependencies for
>>>>>>> us, it has zero external dependencies.
>>>>>>>
>>>>>>> (1) https://github.com/addthis/stream-lib
>>>>>>> (2) https://datasketches.apache.org/
>>>>>>> (3) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
>>>>>>> (4) https://issues.apache.org/jira/browse/CASSANDRA-13338
>>>>>>> (5) https://github.com/apache/cassandra/pull/3767
>>>>>>> (6)
>>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Dmitry Konstantinov
>>>>
>>>

-- 
Dmitry Konstantinov

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Reply via email to