Do these datetime objects implement a the notion of equality you'd expect? (This may be a dumb question; I'm thinking of the equivalent of equals() / hashCode() from the Java world.)
On Sat, Apr 18, 2015 at 4:17 PM, SecondDatke <lovejay-lovemu...@outlook.com> wrote: > I'm trying to solve a Word-Count like problem, the difference lies in that, > I need the count of a specific word among a specific timespan in a social > message stream. > > My data is in the format of (time, message), and I transformed (flatMap > etc.) it into a series of (time, word_id), the time is represented with > Python datetime.datetime class. And I continued to transform it to ((time, > word_id), 1) then use reduceByKey for result. > > But the dataset returned is a little weird, just like the following: > > format: > ((timespan with datetime.datetime, wordid), freq) > > ((datetime.datetime(2009, 10, 6, 2, 0), 0), 8) > ((datetime.datetime(2009, 10, 6, 3, 0), 0), 3) > ((datetime.datetime(2009, 10, 6, 3, 0), 0), 14) > > As you can see, there are DUPLICATED keys, but as a result of reducedByKey, > all keys SHOULD BE UNIQUE. > > I tried to convert the key to a string (like '2006-12-02 21:00:00-000') and > reducedByKey again, the problem stays. It seems the only way left for me is > convert the date to a timestamp, but this time it works. > > Is this expected behavior of reduceByKey(and all other transformations that > work with keys)? > > Currently I'm still working on it. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org