I think you want to flatten the 1M products to a vector of 1M elements, of course mostly are zero. It looks like HashingTF <https://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf> can help you.
2015-08-07 11:02 GMT+08:00 praveen S <mylogi...@gmail.com>: > Use StringIndexer in MLib1.4 : > > https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html > > On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais < > adamantios.cor...@gmail.com> wrote: > >> I have a set of data based on which I want to create a classification >> model. Each row has the following form: >> >> user1,class1,product1 >>> user1,class1,product2 >>> user1,class1,product5 >>> user2,class1,product2 >>> user2,class1,product5 >>> user3,class2,product1 >>> etc >> >> >> There are about 1M users, 2 classes, and 1M products. What I would like >> to do next is create the sparse vectors (something already supported by >> MLlib) BUT in order to apply that function I have to create the dense vectors >> (with the 0s), first. In other words, I have to binarize my data. What's >> the easiest (or most elegant) way of doing that? >> >> >> *// Adamantios* >> >> >> >