> > > Could I write groupCount() in Scala, and then use it from Pyspark? Care > to > > supply an example, I'm finding them hard to find :) > > It's doable, but not so convenient. If you really care about the > performance > difference, you should write your program in Scala. >
Is it possible to write my groupCount(), and use it interactively with Scala? Can you construct libraries of additional functions like this easily? > > I'm still a bit confused about the dual meaning of partition: work > > segmentation, and key groups. Care to clarify anyone - when are > partitions > > used to describe chunks of data for different nodes in the cluster (ie. > > large), and when are they used to describe groups of items in data (ie. > > small).. > > An partition means a chunk of data in RDD, the computation on a partition > is a task, which will be sent to a node in the cluster. > So is a partition sometimes a chunk of data that relates to a single key - or is this only ever by coincidence? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12478.html Sent from the Apache Spark User List mailing list archive at Nabble.com.