Dear Davies, Thanks so much for your instructions! It worked like a charm. Best, Diederik
On Wed, Jul 30, 2014 at 1:27 AM, Davies Liu-2 [via Apache Spark User List] < ml-node+s1001560n10917...@n3.nabble.com> wrote: > Hey Diederik, > > The data in rdd._jrdd.rdd() is serialized by pickle in batch mode by > default, > so the number of rows in it is much less then rdd. for example: > > >>> size = 100 > >>> d = [i%size for i in range(1, 100000)] > >>> rdd = sc.parallelize(d) > >>> rdd.count() > 99999 > >>> rdd._jrdd.rdd().count() > 98L > >>> rdd._jrdd.rdd().countApproxDistinct(4,0) > 29L > >>> rdd._jrdd.rdd().countApproxDistinct(8,0) > 24L > > In order to call countApproxDistinct() in Scala, you need to disable > batch mode serialization > > >>> from pyspark.serializers import PickleSerializer > >>> sc.serializer = PickleSerializer() > >>> rdd = rdd.map(lambda x:x) # change serializer > >>> rdd._jrdd.rdd().count() > 99999L > >>> rdd._jrdd.rdd().countApproxDistinct(4, 0) > 98L > >>> rdd._jrdd.rdd().countApproxDistinct(8, 0) > 103L > > Davies > > > On Tue, Jul 29, 2014 at 11:45 AM, Diederik <[hidden email] > <http://user/SendEmail.jtp?type=node&node=10917&i=0>> wrote: > > > Heya, > > > > I would like to use countApproxDistinct in pyspark, I know that it's an > > experimental method and that it is not yet available in pyspark. I > started > > with porting the countApproxDistinct unit-test to Python, see > > https://gist.github.com/drdee/d68eaf0208184d72cbff. Surprisingly, the > > results are way off. > > > > Using Scala, I get the following two counts (using > > > https://github.com/apache/spark/blob/4c7243e109c713bdfb87891748800109ffbaae07/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L78-87): > > > > scala> simpleRdd.countApproxDistinct(4, 0) > > res2: Long = 73 > > > > scala> simpleRdd.countApproxDistinct(8, 0) > > res3: Long = 99 > > > > In Python, with the same RDD as you can see in the gist, I get the > following > > results: > > > > In [7]: rdd._jrdd.rdd().countApproxDistinct(4, 0) > > Out[7]: 29L > > > > In [8]: rdd._jrdd.rdd().countApproxDistinct(8, 0) > > Out[8]: 26L > > > > > > Clearly, I am doing something wrong here :) What is also weird is that > when > > I set p to 8, I should get a more accurate number, but it's actually > > smaller. Any tips or pointers are much appreciated! > > Best, > > Diederik > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp10878.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp10878p10917.html > To unsubscribe from Using countApproxDistinct in pyspark, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=10878&code=ZHZhbmxpZXJlQGdtYWlsLmNvbXwxMDg3OHwzMDA2ODExMQ==> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp10878p11337.html Sent from the Apache Spark User List mailing list archive at Nabble.com.