enario, this resulted in 430K document/token tuples.
>>>>>
>>>>> ((1, 2), ['Hockey'])
>>>>>
>>>>> This then tells us that documents 1 and 2 need to be compared to one
>>>>> another (via cosine similarity) because t
x27;])
>>>>
>>>> This then tells us that documents 1 and 2 need to be compared to one
>>>> another (via cosine similarity) because they both contain the token
>>>> 'hockey'. I will investigate the methods that you recommended to see if
>>
Hi all,
>
> I'm trying to write a Spark application that will detect similar items (in
> this case products) based on their descriptions. I've got an ML pipeline
> that transforms the product data to TF-IDF representation, using the
> following components.
>
>-
) because they both contain the token
>>> 'hockey'. I will investigate the methods that you recommended to see if
>>> they may resolve our problem.
>>>
>>> Thanks,
>>> Kevin
>>>
>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <
> they may resolve our problem.
>>
>> Thanks,
>> Kevin
>>
>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath > > wrote:
>>
>>> How many products do you have? How large are your vectors?
>>>
>>> It could be that SVD / LSA could b
air similarity with brute force is not
>> going to be scalable. In this case you may want to investigate hashing
>> (LSH) techniques.
>>
>>
>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott
>> wrote:
>>
>>> Hi all,
>>>
>>> I
Mon, 19 Sep 2016 at 22:49, Kevin Mellott
wrote:
> Hi all,
>
> I'm trying to write a Spark application that will detect similar items (in
> this case products) based on their descriptions. I've got an ML pipeline
> that transforms the product data to TF-IDF representati
Hi all,
I'm trying to write a Spark application that will detect similar items (in
this case products) based on their descriptions. I've got an ML pipeline
that transforms the product data to TF-IDF representation, using the
following components.
- *RegexTokenizer* - strips ou