Hi Asaf,
featurestream [1] is an internal project I'm playing with that includes
support for some of this, in particular:
* 1-pass random forest construction
* schema inference
* native support for text fields
Would this be of interest? It's not open source, but if there's sufficient
demand I can
An alternative approach would be to translate your categorical variables
into dummy variables. If your strings represent N classes/categories you
would generate N-1 dummy variables containing 0/1 values.
Auto-magically creating dummy variables from categorical data definitely
comes in handy. I a
The implementation accepts an RDD of LabeledPoint only, so you
couldn't feed in strings from a text file directly. LabeledPoint is a
wrapper around double values rather than strings. How were you trying
to create the input then?
No, it only accepts numeric values, although you can encode
categoric
Hi,
I have been playing around with the new version of Spark MLlib Random
forest implementation, and while in the process, tried it with a file with
String Features.
While training, it fails with:
java.lang.NumberFormatException: For input string.
Is MBLib Random forest adapted to run on top of