There are a few diff ways to apply approximation algorithms and
probabilistic data structures to your Spark data - including Spark's
countApproxDistinct() methods as you pointed out.
There's also Twitter Algebird, and Redis HyperLogLog (PFCOUNT, PFADD).
Here's some examples from my *pipeline Gith
Thanks for the response Jörn. So to elaborate, I have a large dataset with
userIds, each tagged with a property, e.g.:
user_1prop1=X
user_2prop1=Yprop2=A
user_3prop2=B
I would like to be able to get the number of distinct users that have a
particular property (or combination of p
Can you elaborate a little bit more on the use case? It looks a little bit like
an abuse of Spark in general . Interactive queries that are not suitable for
in-memory batch processing might be better supported by ignite that has
in-memory indexes, concept of hot, warm, cold data etc. or hive on