Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Benedict Thu, 02 Jan 2025 11:37:57 -0800

I’m confused Stefan, in what way do you protest? How is your proposal to cache these collections tied to the topic you started here? This should be a separate proposal, discussed on its own merits independently, should it not?

I am not opposed to it happening, only to conflating the two concerns.

On 2 Jan 2025, at 19:28, Štefan Miklošovič <smikloso...@apache.org> wrote:

Hi Chris,

we should put your reasoning under some basic scrutiny. The original whitepaper for HyperLogLog (1) says in the introduction that

- "The new algorithm makes it possible to estimate cardinalities well beyond 10^9 with a typical accuracy of 2% while using a memory of only 1.5 kilobytes."

10^9 is 1 billion entries. I have generated an SSTable with 1 million UUIDs, it produced -Data.db having 24 MB of size. Whole SSTable (-Index.db etc) had 46 MB but lets just say that 1 SSTable takes 24MB for the sake of this example.

Statistics.db file where our serialized cardinality estimator byte array is stored has all 6.7KB and when I read it programmatically and measured (deeply) the size of that array (2), I got 2112 bytes for 1M entries and 2014 for 1k entries.

You see that the number of entries is basically irrelevant. It will always be around 2000 bytes.

Your example of 50k sstables is extreme but anyway, having 50k tables each having 25MB would occupy around 1.2TB on the disk of a node.

50k sstables, times 2000 bytes of hyperloglog is 95MB. This is _extreme example_.

If you had 10k SSTables, you would have ~20MB of hyperloglogs.

So what I meant by caching was misunderstood. We would not have "1k of hyperloglogs in cache and then the rest of it would not fit". No. We would hold everything in memory. Everything.

What would you rather do? Would you hold 20MB of logs in memory OR you would EVERY SINGLE TIME (for Dmitry's case, every minute) go to disk and deserialize 10k SSTable stats? Why would you do that? I have measured how long it would take to go to disk and read the stats. It takes around 1 ms. So 10000 stats = 10 seconds.

So Dmitry with a cfs with 10k SSTables would spend 10 seconds JUST READING DISK. He has not even started to merge the hyperloglogs. And he has to do this every minute.

What I propose is to have 20MB in memory, never go to disk again while reading it every minute and what is nice about hyperloglogs is that their merging is parallelizable. We can have e.g. 4 threads, each merging 2500 hyperloglogs and final 4 hyperloglogs would be merged again. That will cut the computation to 1/4 of the time, basically.

How are you going to make this happen with linear, reading-from-disk-every time?

I understand Benedict's concern but this is clearly not applicable here, hence I protest.

Regards

(1) https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
(2) https://gist.github.com/smiklosovic/7066952df0a2bda65476efacc4d3e3dc#file-atest-java-L39

On Thu, Jan 2, 2025 at 4:25 PM Chris Lohfink <clohfin...@gmail.com> wrote:
I think switching to datasketches is a good idea first off simply because of the lack of maintenance and improvements from clearspring. I am however, am not sold that it will actually improve anything significantly. Caches might help on small cases, but those small cases probably are not actually impacted. In the large cases the caches cost more in complexity, memory, and ultimately wont matter when theres 50k sstables and the cache holds 1k so everythings hitting disk anyway.

The 5% is missing some relevant information like what the allocation rate was, how many tables there are etc. On an idle system thats meaningless, if there were 5gb/s allocations of reads/writes happening at the time thats huge.

On Thu, Jan 2, 2025 at 8:42 AM Štefan Miklošovič <smikloso...@apache.org> wrote:
Interesting, thanks for this. Well ... 5% here, 5% there ... it compounds. I think it is worth trying to do something with this. Would be great if you were part of this effort!

On Thu, Jan 2, 2025 at 3:38 PM Dmitry Konstantinov <netud...@gmail.com> wrote:
I have seen this place in async profiler memory allocation profile on one of production environments some time ago, it was visible but not dramatic, about 5% of allocations:
<image.png>

The amount of overhead also depends on a metric collection frequency (in my case it was once per 60 seconds)

Regards,
Dmitry

On Thu, 2 Jan 2025 at 14:21, Štefan Miklošovič <smikloso...@apache.org> wrote:
Indeed, I plan to measure it and compare, maybe some bench test would be cool to add ..

I strongly suspect that the primary reason for the slowness (if it is verified to be true) is us going to the disk every time and reading stats for every SSTable all over again.

While datasketches say that it is way faster to update (1), we are living in a realm of nanoseconds here and I don't think that itself would make any meaningful difference when merging one hyperloglog with another as part of partition rows estimation computation. The only place we are updating is SortableTableWriter#endParition which calls metadatacollector.addKey(key.getKey()) which eventually updates the estimator via cardinality#offeredHashed.

In other words, I think that going to the disk and reading it repeatedly is disproportionally more IO / time intensive than switching the hyperloglog implementation.

However, I consider the replacement of the library still important. I feel uneasy about staying with an abandoned library where there is clearly a well-maintained replacement.

What we could do is to cache all cardinality estimators and just merge it all when asked upon metric resolution. That is different from going to disk to deserialize StatsComponent's all over again.

Then on SSTable removal, we would remove that from cache too. I think there is some kind of an observer when SSTable is removed ...

However, I am not sure I can just hold it all in the memory, it works for laptop testing but if we have thousands of SSTables with non-trivial number of rows things start to get interesting.

(1) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html - section HllSketch vs. HyperLogLogPlus Update Speed Behavior

On Thu, Jan 2, 2025 at 2:46 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
Sounds interesting. I took a look at the issue but I'm not seeing any data to back up "expensive". Can this be quantified a bit more?

Anytime we have a performance related issue, there should be some data to back it up, even if it seems obvious.

Jon

On Thu, Jan 2, 2025 at 8:20 AM Štefan Miklošovič <smikloso...@apache.org> wrote:
Hello,

I just stumbled upon this library we are using for getting estimations of the number of partitions in a SSTable which are used e.g. in EstimatedPartitionCount metric. (1)

A user reported in (1) that it is an expensive operation. When one looks into what it is doing, it calls SSTableReader.getApproximateKeyCount() (6) which basically goes to disk every single time, it loads all Stats components and it looks into CompactionMetadata where the cardinality estimator is located.

We are serializing the hyperloglog to disk as part of a SSTable and we deserialize it back in runtime for every SSTable in a CF and we merge them all to one cardinality again.

I do not think there is a way around this because of the nature of how a cardinality estimator works (hyperloglog). We can not "cache it", it would work only in case we are adding SSTables only - hence we would just merge again - but if we remove an SSTable as part of the compaction, we can not "unmerge" it.

That being said, we are currently using this library for hyperloglog (1) which was archived in summer 2020 and nothing was contributed to that for 6 years. That lib is dead.

There is very nice replacement of that (2) directly from Apache (!!!) and they are even giving the detailed and in-depth comparison of hyperloglog implementation found in stream-lib we happen to use (3) (stream-lib = Clearspring) where they are saying that updating is way faster and it is also giving better estimations in general.

I have implemented the usage of both cardinality estimators (4), (5). The reason we need to keep the old one around is that we may have old SSTables around and we need to work with them too. That translates to a new SSTable version (ob) which uses new implementation and for versions < ob, it uses the old one. When SSTables are upgraded from oa to ob, the old estimator will not be used anymore.

There is also a case of a user not upgrading his oa SSTables, turning a node on and creating new SSTables with ob version. When this happens and we ask what is the cardinality (e.g via nodetool tablestats), I am checking if all SSTables are on the same version or not. If they are, they will use either an old or new estimator. (we can not merge estimations from two different hyperloglog implementations). If they are not, it will compute that from index summaries. (The computation for index summaries was already in place (6) as a fail-over in case the estimation computation failed / was not present).

Does this all make sense to drive further to the completion and eventually merge this work to trunk?

Worth to add that Apache Datasketches are just two dependencies for us, it has zero external dependencies.

(1) https://github.com/addthis/stream-lib
(2) https://datasketches.apache.org/
(3) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
(4) https://issues.apache.org/jira/browse/CASSANDRA-13338
(5) https://github.com/apache/cassandra/pull/3767
(6) https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L284-L338

Regards

--
Dmitry Konstantinov

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

Reply via email to