The comparison of Python tuple is lexicographical(that is, two tuple equals if and only if every element is same and in the same position), and though I'm not very clear with what's going on inside Spark, I have tried comparing the duplicate keys in the result with == operator, and hashing them in Python, they're same. And as I said, I also tried using String as the key, which also failed.
> From: so...@cloudera.com > Date: Sat, 18 Apr 2015 16:30:59 +0100 > Subject: Re: Does reduceByKey only work properly for numeric keys? > To: lovejay-lovemu...@outlook.com > CC: user@spark.apache.org > > Do these datetime objects implement a the notion of equality you'd > expect? (This may be a dumb question; I'm thinking of the equivalent > of equals() / hashCode() from the Java world.) > > On Sat, Apr 18, 2015 at 4:17 PM, SecondDatke > <lovejay-lovemu...@outlook.com> wrote: > > I'm trying to solve a Word-Count like problem, the difference lies in that, > > I need the count of a specific word among a specific timespan in a social > > message stream. > > > > My data is in the format of (time, message), and I transformed (flatMap > > etc.) it into a series of (time, word_id), the time is represented with > > Python datetime.datetime class. And I continued to transform it to ((time, > > word_id), 1) then use reduceByKey for result. > > > > But the dataset returned is a little weird, just like the following: > > > > format: > > ((timespan with datetime.datetime, wordid), freq) > > > > ((datetime.datetime(2009, 10, 6, 2, 0), 0), 8) > > ((datetime.datetime(2009, 10, 6, 3, 0), 0), 3) > > ((datetime.datetime(2009, 10, 6, 3, 0), 0), 14) > > > > As you can see, there are DUPLICATED keys, but as a result of reducedByKey, > > all keys SHOULD BE UNIQUE. > > > > I tried to convert the key to a string (like '2006-12-02 21:00:00-000') and > > reducedByKey again, the problem stays. It seems the only way left for me is > > convert the date to a timestamp, but this time it works. > > > > Is this expected behavior of reduceByKey(and all other transformations that > > work with keys)? > > > > Currently I'm still working on it. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >