On Mon, Aug 18, 2014 at 7:41 PM, fil <f...@pobox.com> wrote: > fil wrote >> - Python functions like groupCount; these get reflected from their Python >> AST and converted into a Spark DAG? Presumably if I try and do something >> non-convertible this transformation process will throw an error? In other >> words this runs in the JVM. > > Further to this - it seems that Python does run on each node in the cluster, > meaning it runs outside the JVM. Presumably this means that writing this in > Scala would be far more performant. > > Could I write groupCount() in Scala, and then use it from Pyspark? Care to > supply an example, I'm finding them hard to find :)
It's doable, but not so convenient. If you really care about the performance difference, you should write your program in Scala. > > fil wrote >> - I had considered that "partitions" were batches of distributable work, >> and generally large. Presumably the above is OK with small groups (eg. >> average size < 10) - this won't kill performance? > > I'm still a bit confused about the dual meaning of partition: work > segmentation, and key groups. Care to clarify anyone - when are partitions > used to describe chunks of data for different nodes in the cluster (ie. > large), and when are they used to describe groups of items in data (ie. > small).. An partition means a chunk of data in RDD, the computation on a partition is a task, which will be sent to a node in the cluster. > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12342.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org