I ended up with the following:
def firstValue(items: Iterable[String]) = for { i <- items
} yield (i, items.head)
data.groupByKey().map{case(a, b)=>(a, firstValue(b))}.collect
More details:
http://dmtolpeko.com/2015/02/17/first_value-last_value-lead-and-lag-in-spark/
I would appreciate any feedback.
Dmitry
On Fri, Feb 13, 2015 at 11:54 AM, Dmitry Tolpeko <[email protected]>
wrote:
> Hello,
>
> To convert existing Map Reduce jobs to Spark, I need to implement window
> functions such as FIRST_VALUE, LEAD, LAG and so on. For example,
> FIRST_VALUE function:
>
> Source (1st column is key):
>
> A, A1
> A, A2
> A, A3
> B, B1
> B, B2
> C, C1
>
> and the result should be
>
> A, A1, A1
> A, A2, A1
> A, A3, A1
> B, B1, B1
> B, B2, B1
> C, C1, C1
>
> You can see that the first value in a group is repeated in each row.
>
> My current Spark/Scala code:
>
> def firstValue(b: Iterable[String]) : List[(String, String)] = {
> val c = scala.collection.mutable.MutableList[(String, String)]()
> var f = ""
> b.foreach(d => { if(f.isEmpty()) f = d; c += d -> f})
> c.toList
> }
>
> val data=sc.parallelize(List(
> ("A", "A1"),
> ("A", "A2"),
> ("A", "A3"),
> ("B", "B1"),
> ("B", "B2"),
> ("C", "C1")))
>
> data.groupByKey().map{case(a, b)=>(a, firstValue(b))}.collect
>
> So I create a new list after groupByKey. Is it right approach to do this
> in Spark? Are there any other options? Please point me to any drawbacks.
>
> Thanks,
>
> Dmitry
>
>