Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
ue. > > > On June 11, 2015, at 3:17 PM, Sean Owen wrote: > > > Yep you need to use a transformation of the raw value; use toString for > example. > > On Thu, Jun 11, 2015, 8:54 PM Crystal Xing > wrote: > >> That is a little scary. >> So you mea

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
nd refs to them since they change. So you may > have a bunch of copies of one object at the end that become just one in > each partition. > > On Thu, Jun 11, 2015, 8:36 PM Crystal Xing > wrote: > >> I load a list of ids from a text file as NLineInputFormat, and when I

Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
I load a list of ids from a text file as NLineInputFormat, and when I do distinct(), it returns incorrect number. JavaRDD idListData = jvc .hadoopFile(idList, NLineInputFormat.class, LongWritable.class, Text.class).values().distinct() I should have 7000K

Re: how to map and filter in one step?

2015-02-26 Thread Crystal Xing
wrote: > You can flatMap: > > rdd.flatMap { in => > if (condition(in)) { > Some(transformation(in)) > } else { > None > } > } > > On Thu, Feb 26, 2015 at 6:39 PM, Crystal Xing > wrote: > > Hi, > > I have a text file input and I want

how to map and filter in one step?

2015-02-26 Thread Crystal Xing
Hi, I have a text file input and I want to parse line by line and map each line to another format. But at the same time, I want to filter out some lines I do not need. I wonder if there is a way to filter out those lines in the map function. Do I have to do two steps filter and map? In that way,

Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
it's all taken care of by the implementation. > > On Thu, Feb 12, 2015 at 11:29 PM, Crystal Xing > wrote: > > HI Sean, > > > > I am reading the paper of implicit training. > > > > Collaborative Filtering for Implicit Feedback Datasets > > > >

Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
automatically takes care of those no interaction user_product pairs ? On Thu, Feb 12, 2015 at 3:13 PM, Sean Owen wrote: > Where there is no user-item interaction, you provide no interaction, > not an interaction with strength 0. Otherwise your input is fully > dense. > > On T

Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
Hi, I have some implicit rating data, such as the purchasing data. I read the paper about the implicit training algorithm used in spark and it mentioned the for user-prodct pairs which do not have implicit rating data, such as no purchase, we need to provide the value as 0. This is different fro

Re: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
mething to do, if you can avoid it > architecturally. For example, consider precomputing recommendations > only for users whose probability of needing recommendations soon is > not very small. Usually, only a small number of users are active. > > On Thu, Feb 12, 2015 at 10:26 P

Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
Hi, I wonder if there is a way to do fast top N product recommendations for all users in training using mllib's ALS algorithm. I am currently calling public Rating [] recommendProducts(int user,