I see. It makes a lot of sense now. It is not unique to spark but it would be great if it is mentioned in spark documentation.
I have been using hadoop for a while and I am not aware of it! Zheng zheng On Thu, Jun 11, 2015 at 7:21 PM, Will Briggs <wrbri...@gmail.com> wrote: > To be fair, this is a long-standing issue due to optimizations for object > reuse in the Hadoop API, and isn't necessarily a failing in Spark - see > this blog post ( > https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/) > from 2011 documenting a similar issue. > > > On June 11, 2015, at 3:17 PM, Sean Owen <so...@cloudera.com> wrote: > > > Yep you need to use a transformation of the raw value; use toString for > example. > > On Thu, Jun 11, 2015, 8:54 PM Crystal Xing <crystalxin...@gmail.com> > wrote: > >> That is a little scary. >> So you mean in general, we shouldn't use hadoop's writable as Key in >> RDD? >> >> Zheng zheng >> >> On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen <so...@cloudera.com> wrote: >> >>> Guess: it has something to do with the Text object being reused by >>> Hadoop? You can't in general keep around refs to them since they change. So >>> you may have a bunch of copies of one object at the end that become just >>> one in each partition. >>> >>> On Thu, Jun 11, 2015, 8:36 PM Crystal Xing <crystalxin...@gmail.com> >>> wrote: >>> >>>> I load a list of ids from a text file as NLineInputFormat, and when I >>>> do distinct(), it returns incorrect number. >>>> JavaRDD<Text> idListData = jvc >>>> .hadoopFile(idList, NLineInputFormat.class, >>>> LongWritable.class, >>>> Text.class).values().distinct() >>>> >>>> >>>> I should have 7000K distinct value, how every it only returns 7000 >>>> values, which is the same as number of tasks. The type I am using is >>>> import org.apache.hadoop.io.Text; >>>> >>>> >>>> However, if I switch to use String instead of Text, it works correcly. >>>> >>>> I think the Text class should have correct implementation of equals() >>>> and hashCode() functions since it is the hadoop class. >>>> >>>> Does anyone have clue what is going on? >>>> >>>> I am using spark 1.2. >>>> >>>> Zheng zheng >>>> >>>> >>>> >>