On Mon, Aug 18, 2014 at 7:41 PM, fil <f...@pobox.com> wrote:
> fil wrote
>> - Python functions like groupCount; these get reflected from their Python
>> AST and converted into a Spark DAG? Presumably if I try and do something
>> non-convertible this transformation process will throw an error? In other
>> words this runs in the JVM.
>
> Further to this - it seems that Python does run on each node in the cluster,
> meaning it runs outside the JVM. Presumably this means that writing this in
> Scala would be far more performant.
>
> Could I write groupCount() in Scala, and then use it from Pyspark? Care to
> supply an example, I'm finding them hard to find :)

It's doable, but not so convenient. If you really care about the performance
difference, you should write your program in Scala.

>
> fil wrote
>> - I had considered that "partitions" were batches of distributable work,
>> and generally large. Presumably the above is OK with small groups (eg.
>> average size < 10) - this won't kill performance?
>
> I'm still a bit confused about the dual meaning of partition: work
> segmentation, and key groups. Care to clarify anyone - when are partitions
> used to describe chunks of data for different nodes in the cluster (ie.
> large), and when are they used to describe groups of items in data (ie.
> small)..

An partition means a chunk of data in RDD, the computation on a partition
is a task, which will be sent to a node in the cluster.


> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12342.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to