>
> > Could I write groupCount() in Scala, and then use it from Pyspark? Care
> to
> > supply an example, I'm finding them hard to find :)
>
> It's doable, but not so convenient. If you really care about the
> performance
> difference, you should write your program in Scala.
>
Is it possible to wr
fil wrote
> - Python functions like groupCount; these get reflected from their Python
> AST and converted into a Spark DAG? Presumably if I try and do something
> non-convertible this transformation process will throw an error? In other
> words this runs in the JVM.
Further to thi
will happen frequently. So by this you mean scanning the
resulting mapPartitions() results? Presumably I could eliminate adjacent
duplicates - or specifically look for duplicates at the end/start of
different "batches" (what is the Spark term for this) from different nodes
in the cluster. What
Can anyone assist with a scan of the following kind (Python preferred, but
whatever..)? I'm looking for a kind of segmented fold count.
Input: [1,1,1,2,2,3,4,4,5,1]
Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)]
or preferably two output columns:
id: [1,2,3,4,5,1]
count: [3,2,1,2,1,1]
I