Re: Similar Items

2016-09-21 Thread Nick Pentreath
enario, this resulted in 430K document/token tuples. >>>>> >>>>> ((1, 2), ['Hockey']) >>>>> >>>>> This then tells us that documents 1 and 2 need to be compared to one >>>>> another (via cosine similarity) because t

Re: Similar Items

2016-09-21 Thread Nick Pentreath
x27;]) >>>> >>>> This then tells us that documents 1 and 2 need to be compared to one >>>> another (via cosine similarity) because they both contain the token >>>> 'hockey'. I will investigate the methods that you recommended to see if >>

Re: Similar Items

2016-09-20 Thread Peter Figliozzi
Hi all, > > I'm trying to write a Spark application that will detect similar items (in > this case products) based on their descriptions. I've got an ML pipeline > that transforms the product data to TF-IDF representation, using the > following components. > >-

Re: Similar Items

2016-09-20 Thread Kevin Mellott
) because they both contain the token >>> 'hockey'. I will investigate the methods that you recommended to see if >>> they may resolve our problem. >>> >>> Thanks, >>> Kevin >>> >>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <

Re: Similar Items

2016-09-20 Thread Kevin Mellott
> they may resolve our problem. >> >> Thanks, >> Kevin >> >> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath > > wrote: >> >>> How many products do you have? How large are your vectors? >>> >>> It could be that SVD / LSA could b

Re: Similar Items

2016-09-20 Thread Nick Pentreath
air similarity with brute force is not >> going to be scalable. In this case you may want to investigate hashing >> (LSH) techniques. >> >> >> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott >> wrote: >> >>> Hi all, >>> >>> I&#

Re: Similar Items

2016-09-19 Thread Nick Pentreath
Mon, 19 Sep 2016 at 22:49, Kevin Mellott wrote: > Hi all, > > I'm trying to write a Spark application that will detect similar items (in > this case products) based on their descriptions. I've got an ML pipeline > that transforms the product data to TF-IDF representati

Similar Items

2016-09-19 Thread Kevin Mellott
Hi all, I'm trying to write a Spark application that will detect similar items (in this case products) based on their descriptions. I've got an ML pipeline that transforms the product data to TF-IDF representation, using the following components. - *RegexTokenizer* - strips ou