If you corpus is large (nlp) this is indeed the best solution otherwise
(few words I.e. Categories)  I guess you will end up with the same result

On Friday, 6 November 2015, Balachandar R.A. <balachandar...@gmail.com>
wrote:

> Hi Guillaume,
>
>
> This is always an option. However, I read about HashingTF which exactly
> does this quite efficiently and can scale too. Hence, looking for a
> solution using this technique.
>
>
> regards
> Bala
>
>
> On 5 November 2015 at 18:50, tog <guillaume.all...@gmail.com
> <javascript:_e(%7B%7D,'cvml','guillaume.all...@gmail.com');>> wrote:
>
>> Hi Bala
>>
>> Can't you do a simple dictionnary and map those values to numbers?
>>
>> Cheers
>> Guillaume
>>
>> On 5 November 2015 at 09:54, Balachandar R.A. <balachandar...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','balachandar...@gmail.com');>> wrote:
>>
>>> HI
>>>
>>>
>>> I am new to spark MLlib and machine learning. I have a csv file that
>>> consists of around 100 thousand rows and 20 columns. Of these 20 columns,
>>> 10 contains string values. Each value in these columns are not necessarily
>>> unique. They are kind of categorical, that is, the values could be one
>>> amount, say 10 values. To start with, I could run examples, especially,
>>> random forest algorithm in my local spark (1.5.1.) platform. However, I
>>> have a challenge with my dataset due to these strings as the APIs takes
>>> numerical values. Can any one tell me how I can map these categorical
>>> values (strings) into numbers and use them with random forest algorithms?
>>> Any example will be greatly appreciated.
>>>
>>>
>>> regards
>>>
>>> Bala
>>>
>>
>>
>>
>> --
>> PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net
>>
>
>

-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net

Reply via email to