If you corpus is large (nlp) this is indeed the best solution otherwise (few words I.e. Categories) I guess you will end up with the same result
On Friday, 6 November 2015, Balachandar R.A. <balachandar...@gmail.com> wrote: > Hi Guillaume, > > > This is always an option. However, I read about HashingTF which exactly > does this quite efficiently and can scale too. Hence, looking for a > solution using this technique. > > > regards > Bala > > > On 5 November 2015 at 18:50, tog <guillaume.all...@gmail.com > <javascript:_e(%7B%7D,'cvml','guillaume.all...@gmail.com');>> wrote: > >> Hi Bala >> >> Can't you do a simple dictionnary and map those values to numbers? >> >> Cheers >> Guillaume >> >> On 5 November 2015 at 09:54, Balachandar R.A. <balachandar...@gmail.com >> <javascript:_e(%7B%7D,'cvml','balachandar...@gmail.com');>> wrote: >> >>> HI >>> >>> >>> I am new to spark MLlib and machine learning. I have a csv file that >>> consists of around 100 thousand rows and 20 columns. Of these 20 columns, >>> 10 contains string values. Each value in these columns are not necessarily >>> unique. They are kind of categorical, that is, the values could be one >>> amount, say 10 values. To start with, I could run examples, especially, >>> random forest algorithm in my local spark (1.5.1.) platform. However, I >>> have a challenge with my dataset due to these strings as the APIs takes >>> numerical values. Can any one tell me how I can map these categorical >>> values (strings) into numbers and use them with random forest algorithms? >>> Any example will be greatly appreciated. >>> >>> >>> regards >>> >>> Bala >>> >> >> >> >> -- >> PGP KeyID: 2048R/EA31CFC9 subkeys.pgp.net >> > > -- PGP KeyID: 2048R/EA31CFC9 subkeys.pgp.net