RE: Does reduceByKey only work properly for numeric keys?

SecondDatke Sat, 18 Apr 2015 08:46:47 -0700

Thanks for your reply.
I used a simple (a+b) function for reduceByKey(), nothing special. And I'm 
running Spark 1.3.1 RC3 (don't know how much it differs from the final release).

Date: Sat, 18 Apr 2015 08:28:50 -0700
Subject: Re: Does reduceByKey only work properly for numeric keys?
From: yuzhih...@gmail.com
To: lovejay-lovemu...@outlook.com
CC: user@spark.apache.org

Can you show us the function you passed to reduceByKey() ?
What release of Spark are you using ?
Cheers
On Sat, Apr 18, 2015 at 8:17 AM, SecondDatke <lovejay-lovemu...@outlook.com> 
wrote:

I'm trying to solve a Word-Count like problem, the difference lies in that, I 
need the count of a specific word among a specific timespan in a social message 
stream.
My data is in the format of (time, message), and I transformed (flatMap etc.) 
it into a series of (time, word_id), the time is represented with Python 
datetime.datetime class. And I continued to transform it to ((time, word_id), 
1) then use reduceByKey for result.
But the dataset returned is a little weird, just like the following:
format:((timespan with datetime.datetime, wordid), freq)
((datetime.datetime(2009, 10, 6, 2, 0), 0), 8)((datetime.datetime(2009, 10, 6, 
3, 0), 0), 3)((datetime.datetime(2009, 10, 6, 3, 0), 0), 14)
As you can see, there are DUPLICATED keys, but as a result of reducedByKey, all 
keys SHOULD BE UNIQUE.
I tried to convert the key to a string (like '2006-12-02 21:00:00-000') and 
reducedByKey again, the problem stays. It seems the only way left for me is 
convert the date to a timestamp, but this time it works.
Is this expected behavior of reduceByKey(and all other transformations that 
work with keys)?
Currently I'm still working on it.

RE: Does reduceByKey only work properly for numeric keys?

Reply via email to