Re: Using countApproxDistinct in pyspark

Diederik Mon, 04 Aug 2014 09:24:13 -0700

Dear Davies,

Thanks so much for your instructions! It worked like a charm.
Best,
Diederik



On Wed, Jul 30, 2014 at 1:27 AM, Davies Liu-2 [via Apache Spark User List] <
ml-node+s1001560n10917...@n3.nabble.com> wrote:

> Hey Diederik,
>
> The data in rdd._jrdd.rdd() is serialized by pickle in batch mode by
> default,
> so the number of rows in it is much less then rdd. for example:
>
> >>> size = 100
> >>> d = [i%size for i in range(1, 100000)]
> >>> rdd = sc.parallelize(d)
> >>> rdd.count()
> 99999
> >>> rdd._jrdd.rdd().count()
> 98L
> >>> rdd._jrdd.rdd().countApproxDistinct(4,0)
> 29L
> >>> rdd._jrdd.rdd().countApproxDistinct(8,0)
> 24L
>
> In order to call countApproxDistinct() in Scala, you need to disable
> batch mode serialization
>
> >>> from pyspark.serializers import PickleSerializer
> >>> sc.serializer = PickleSerializer()
> >>> rdd = rdd.map(lambda x:x)  # change serializer
> >>> rdd._jrdd.rdd().count()
> 99999L
> >>> rdd._jrdd.rdd().countApproxDistinct(4, 0)
> 98L
> >>> rdd._jrdd.rdd().countApproxDistinct(8, 0)
> 103L
>
> Davies
>
>
> On Tue, Jul 29, 2014 at 11:45 AM, Diederik <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=10917&i=0>> wrote:
>
> > Heya,
> >
> > I would like to use countApproxDistinct in pyspark, I know that it's an
> > experimental method and that it is not yet available in pyspark. I
> started
> > with porting the countApproxDistinct unit-test to Python, see
> > https://gist.github.com/drdee/d68eaf0208184d72cbff. Surprisingly, the
> > results are way off.
> >
> > Using Scala, I get the following two counts (using
> >
> https://github.com/apache/spark/blob/4c7243e109c713bdfb87891748800109ffbaae07/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L78-87):
> >
> > scala> simpleRdd.countApproxDistinct(4, 0)
> > res2: Long = 73
> >
> > scala> simpleRdd.countApproxDistinct(8, 0)
> > res3: Long = 99
> >
> > In Python, with the same RDD as you can see in the gist, I get the
> following
> > results:
> >
> > In [7]: rdd._jrdd.rdd().countApproxDistinct(4, 0)
> > Out[7]: 29L
> >
> > In [8]: rdd._jrdd.rdd().countApproxDistinct(8, 0)
> > Out[8]: 26L
> >
> >
> > Clearly, I am doing something wrong here :) What is also weird is that
> when
> > I set p to 8, I should get a more accurate number, but it's actually
> > smaller. Any tips or pointers are much appreciated!
> > Best,
> > Diederik
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp10878.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp10878p10917.html
>  To unsubscribe from Using countApproxDistinct in pyspark, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=10878&code=ZHZhbmxpZXJlQGdtYWlsLmNvbXwxMDg3OHwzMDA2ODExMQ==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp10878p11337.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Using countApproxDistinct in pyspark

Reply via email to