Re: word count (group by users) in spark

2015-09-21 Thread Huy Banh
Hi Sri, The yield() is from scala. In a for comprehension, it creates a returned collection, and yield adds elements to that return collection. To output strings instead of tuple data, you could use aggregateByKey, instead of groupByKey: val output = wordCounts. map({case ((user, word), count)

Re: word count (group by users) in spark

2015-09-21 Thread Zhang, Jingyu
Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory exception occurs. Cheers, Jingyu On

Re: word count (group by users) in spark

2015-09-20 Thread Aniket Bhatnagar
Unless I am mistaken, in a group by operation, it spills to disk in case values for a key don't fit in memory. Thanks, Aniket On Mon, Sep 21, 2015 at 10:43 AM Huy Banh wrote: > Hi, > > If your input format is user -> comment, then you could: > > val comments = sc.parallelize(List(("u1", "one tw

Re: word count (group by users) in spark

2015-09-20 Thread Huy Banh
Hi, If your input format is user -> comment, then you could: val comments = sc.parallelize(List(("u1", "one two one"), ("u2", "three four three"))) val wordCounts = comments. flatMap({case (user, comment) => for (word <- comment.split(" ")) yield(((user, word), 1)) }). reduceByKey(_

Re: word count (group by users) in spark

2015-09-19 Thread Aniket Bhatnagar
Using scala API, you can first group by user and then use combineByKey. Thanks, Aniket On Sat, Sep 19, 2015, 6:41 PM kali.tumm...@gmail.com wrote: > Hi All, > I would like to achieve this below output using spark , I managed to write > in Hive and call it in spark but not in just spark (scala),