There are a few diff ways to apply approximation algorithms and
probabilistic data structures to your Spark data - including Spark's
countApproxDistinct() methods as you pointed out.
There's also Twitter Algebird, and Redis HyperLogLog (PFCOUNT, PFADD).
Here's some examples from my *pipeline Gith
Thanks for the response Jörn. So to elaborate, I have a large dataset with
userIds, each tagged with a property, e.g.:
user_1prop1=X
user_2prop1=Yprop2=A
user_3prop2=B
I would like to be able to get the number of distinct users that have a
particular property (or combination of p
Can you elaborate a little bit more on the use case? It looks a little bit like
an abuse of Spark in general . Interactive queries that are not suitable for
in-memory batch processing might be better supported by ignite that has
in-memory indexes, concept of hot, warm, cold data etc. or hive on
Hi all,
What's the best way to run ad-hoc queries against a cached RDDs?
For example, say I have an RDD that has been processed, and persisted to
memory-only. I want to be able to run a count (actually
"countApproxDistinct") after filtering by an, at compile time, unknown
(specified by query) val