rdd.rdd().countApproxDistinct(4, 0)
> > Out[7]: 29L
> >
> > In [8]: rdd._jrdd.rdd().countApproxDistinct(8, 0)
> > Out[8]: 26L
> >
> >
> > Clearly, I am doing something wrong here :) What is also weird is that
> when
> > I set p to 8, I should get a mor
untApproxDistinct(4, 0)
> Out[7]: 29L
>
> In [8]: rdd._jrdd.rdd().countApproxDistinct(8, 0)
> Out[8]: 26L
>
>
> Clearly, I am doing something wrong here :) What is also weird is that when
> I set p to 8, I should get a more accurate number, but it's actually
> smaller
hat is also weird is that when
I set p to 8, I should get a more accurate number, but it's actually
smaller. Any tips or pointers are much appreciated!
Best,
Diederik
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp1087