Hello Datasketches community,

I am from Apache Cassandra where we use Clearspring (1) for estimating the
cardinalities for rows in Cassandra's SSTables. We serialize the whole
HyperLogLog from (1) (more or less) to the disk and then we deserialize it
back and we merge all logs together to know the final result across the
whole data.

(1) is, as you probably know, archived / not actively contributed anymore.
Hence, we are looking for replacements.

Datasketches are quite an obvious choice but I would like to know some
answers to the questions before the transition.

We need to work with old data as well. If there is an SSTable on a disk
with HLL from Clearspring, then we can not merge this to Datasketches,
right? In other words, this is not possible:

    @Test
    public void testMerging() throws Throwable
    {
        // wrapper around Clearspring
        LegacyCardinality clearspringCardinality = new
LegacyCardinality(new HyperLogLogPlus(13, 25));
        clearspringCardinality.offerHashed(12345);

        // wrapper around Datasketches HLL
        DefaultCardinality datasketchesCardinality = new
DefaultCardinality();
        datasketchesCardinality.offerHashed(23456);

        // this fails, as well as similar variations of that
        clearspringCardinality.merge(new
LegacyCardinality(HyperLogLogPlus.Builder.build(datasketchesCardinality.getBytes())).getCardinality());
    }

It would be great if you confirmed (or denied) that there is no way to
merge these two together. How would you go around this problem in general?
If they are not mergeable, then we would need to find another way to deal
with this but that is another story.

I see that there is (2) which is a great in-depth description of
differences between two but there is no information to my knowledge which
would say if one is convertible to another.

Thank you and regards

Stefan Miklosovic

(1) https://github.com/addthis/stream-lib/tree/master
(2) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html

Reply via email to