>
> > Could I write groupCount() in Scala, and then use it from Pyspark? Care
> to
> > supply an example, I'm finding them hard to find :)
>
> It's doable, but not so convenient. If you really care about the
> performance
> difference, you should write your program in Scala.
>

Is it possible to write my groupCount(), and use it interactively with
Scala? Can you construct libraries of additional functions like this easily?


> > I'm still a bit confused about the dual meaning of partition: work
> > segmentation, and key groups. Care to clarify anyone - when are
> partitions
> > used to describe chunks of data for different nodes in the cluster (ie.
> > large), and when are they used to describe groups of items in data (ie.
> > small)..
>
> An partition means a chunk of data in RDD, the computation on a partition
> is a task, which will be sent to a node in the cluster.
>

So is a partition sometimes a chunk of data that relates to a single key -
or is this only ever by coincidence?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12478.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to