Can you show us the function you passed to reduceByKey() ?

What release of Spark are you using ?

Cheers

On Sat, Apr 18, 2015 at 8:17 AM, SecondDatke <lovejay-lovemu...@outlook.com>
wrote:

> I'm trying to solve a Word-Count like problem, the difference lies in
> that, I need the count of a specific word among a specific timespan in a
> social message stream.
>
> My data is in the format of (time, message), and I transformed (flatMap
> etc.) it into a series of (time, word_id), the time is represented with
> Python datetime.datetime class. And I continued to transform it to ((time,
> word_id), 1) then use reduceByKey for result.
>
> But the dataset returned is a little weird, just like the following:
>
> format:
> ((timespan with datetime.datetime, wordid), freq)
>
> ((datetime.datetime(2009, 10, 6, 2, 0), 0), 8)
> ((datetime.datetime(2009, 10, 6, 3, 0), 0), 3)
> ((datetime.datetime(2009, 10, 6, 3, 0), 0), 14)
>
> As you can see, there are DUPLICATED keys, but as a result of
> reducedByKey, all keys SHOULD BE UNIQUE.
>
> I tried to convert the key to a string (like '2006-12-02 21:00:00-000')
> and reducedByKey again, the problem stays. It seems the only way left for
> me is convert the date to a timestamp, but this time it works.
>
> Is this expected behavior of reduceByKey(and all other transformations
> that work with keys)?
>
> Currently I'm still working on it.
>

Reply via email to