Hi Sri,
The yield() is from scala. In a for comprehension, it creates a returned
collection, and yield adds elements to that return collection.
To output strings instead of tuple data, you could use aggregateByKey,
instead of groupByKey:
val output = wordCounts.
map({case ((user, word), count)
Spark spills data to disk when there is more data shuffled onto a single
executor machine than can fit in memory. However, it flushes out the data
to disk one key at a time - so if a single key has more key-value pairs
than can fit in memory, an out of memory exception occurs.
Cheers,
Jingyu
On
Unless I am mistaken, in a group by operation, it spills to disk in case
values for a key don't fit in memory.
Thanks,
Aniket
On Mon, Sep 21, 2015 at 10:43 AM Huy Banh wrote:
> Hi,
>
> If your input format is user -> comment, then you could:
>
> val comments = sc.parallelize(List(("u1", "one tw
Hi,
If your input format is user -> comment, then you could:
val comments = sc.parallelize(List(("u1", "one two one"), ("u2", "three
four three")))
val wordCounts = comments.
flatMap({case (user, comment) =>
for (word <- comment.split(" ")) yield(((user, word), 1)) }).
reduceByKey(_
Using scala API, you can first group by user and then use combineByKey.
Thanks,
Aniket
On Sat, Sep 19, 2015, 6:41 PM kali.tumm...@gmail.com
wrote:
> Hi All,
> I would like to achieve this below output using spark , I managed to write
> in Hive and call it in spark but not in just spark (scala),