Hi Guillaume,
This is always an option. However, I read about HashingTF which exactly does this quite efficiently and can scale too. Hence, looking for a solution using this technique. regards Bala On 5 November 2015 at 18:50, tog <guillaume.all...@gmail.com> wrote: > Hi Bala > > Can't you do a simple dictionnary and map those values to numbers? > > Cheers > Guillaume > > On 5 November 2015 at 09:54, Balachandar R.A. <balachandar...@gmail.com> > wrote: > >> HI >> >> >> I am new to spark MLlib and machine learning. I have a csv file that >> consists of around 100 thousand rows and 20 columns. Of these 20 columns, >> 10 contains string values. Each value in these columns are not necessarily >> unique. They are kind of categorical, that is, the values could be one >> amount, say 10 values. To start with, I could run examples, especially, >> random forest algorithm in my local spark (1.5.1.) platform. However, I >> have a challenge with my dataset due to these strings as the APIs takes >> numerical values. Can any one tell me how I can map these categorical >> values (strings) into numbers and use them with random forest algorithms? >> Any example will be greatly appreciated. >> >> >> regards >> >> Bala >> > > > > -- > PGP KeyID: 2048R/EA31CFC9 subkeys.pgp.net >