Hi,
I saw this a little bit late. I implement a custom count distinct for our
streaming use case. If you are looking for something close enough but not
exact you can use my UDF. It uses the HyperLogLogPlus algorithm, which is
an efficient and scalable way to estimate cardinality with a controlled
Unfortunatelly, Beam SQL doesn’t support COUNT(DISTINCT) aggregation.
More details about “why" is on this discussion [1] and the related open issue
for that here [2].
—
Alexey
[1] https://lists.apache.org/thread/hvmy6d5dls3m8xcnf74hfmy1xxfgj2xh
[2] https://github.com/apache/beam/issues/19398