Re: How to binarize data in spark

Yanbo Liang Thu, 06 Aug 2015 22:38:01 -0700

I think you want to flatten the 1M products to a vector of 1M elements, of
course mostly are zero.
It looks like HashingTF
<https://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf>
can help you.


2015-08-07 11:02 GMT+08:00 praveen S <mylogi...@gmail.com>:

> Use StringIndexer in MLib1.4 :
>
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html
>
> On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais <
> adamantios.cor...@gmail.com> wrote:
>
>> I have a set of data based on which I want to create a classification
>> model. Each row has the following form:
>>
>> user1,class1,product1
>>> user1,class1,product2
>>> user1,class1,product5
>>> user2,class1,product2
>>> user2,class1,product5
>>> user3,class2,product1
>>> etc
>>
>>
>> There are about 1M users, 2 classes, and 1M products. What I would like
>> to do next is create the sparse vectors (something already supported by
>> MLlib) BUT in order to apply that function I have to create the dense vectors
>> (with the 0s), first. In other words, I have to binarize my data. What's
>> the easiest (or most elegant) way of doing that?
>>
>>
>> *// Adamantios*
>>
>>
>>
>

Re: How to binarize data in spark

Reply via email to