Can you show us the function you passed to reduceByKey() ? What release of Spark are you using ?
Cheers On Sat, Apr 18, 2015 at 8:17 AM, SecondDatke <lovejay-lovemu...@outlook.com> wrote: > I'm trying to solve a Word-Count like problem, the difference lies in > that, I need the count of a specific word among a specific timespan in a > social message stream. > > My data is in the format of (time, message), and I transformed (flatMap > etc.) it into a series of (time, word_id), the time is represented with > Python datetime.datetime class. And I continued to transform it to ((time, > word_id), 1) then use reduceByKey for result. > > But the dataset returned is a little weird, just like the following: > > format: > ((timespan with datetime.datetime, wordid), freq) > > ((datetime.datetime(2009, 10, 6, 2, 0), 0), 8) > ((datetime.datetime(2009, 10, 6, 3, 0), 0), 3) > ((datetime.datetime(2009, 10, 6, 3, 0), 0), 14) > > As you can see, there are DUPLICATED keys, but as a result of > reducedByKey, all keys SHOULD BE UNIQUE. > > I tried to convert the key to a string (like '2006-12-02 21:00:00-000') > and reducedByKey again, the problem stays. It seems the only way left for > me is convert the date to a timestamp, but this time it works. > > Is this expected behavior of reduceByKey(and all other transformations > that work with keys)? > > Currently I'm still working on it. >