Re: Spark random forest - string data

2015-01-16 Thread Andy Twigg
Hi Asaf, featurestream [1] is an internal project I'm playing with that includes support for some of this, in particular: * 1-pass random forest construction * schema inference * native support for text fields Would this be of interest? It's not open source, but if there's sufficient demand I can

Re: Spark random forest - string data

2015-01-16 Thread Nick Allen
An alternative approach would be to translate your categorical variables into dummy variables. If your strings represent N classes/categories you would generate N-1 dummy variables containing 0/1 values. Auto-magically creating dummy variables from categorical data definitely comes in handy. I a

Re: Spark random forest - string data

2015-01-16 Thread Sean Owen
The implementation accepts an RDD of LabeledPoint only, so you couldn't feed in strings from a text file directly. LabeledPoint is a wrapper around double values rather than strings. How were you trying to create the input then? No, it only accepts numeric values, although you can encode categoric

Spark random forest - string data

2015-01-16 Thread Asaf Lahav
Hi, I have been playing around with the new version of Spark MLlib Random forest implementation, and while in the process, tried it with a file with String Features. While training, it fails with: java.lang.NumberFormatException: For input string. Is MBLib Random forest adapted to run on top of