Hi Asaf,
featurestream [1] is an internal project I'm playing with that includes
support for some of this, in particular:
* 1-pass random forest construction
* schema inference
* native support for text fields
Would this be of interest? It's not open source, but if there's sufficient
demand I can
An alternative approach would be to translate your categorical variables
into dummy variables. If your strings represent N classes/categories you
would generate N-1 dummy variables containing 0/1 values.
Auto-magically creating dummy variables from categorical data definitely
comes in handy. I a
The implementation accepts an RDD of LabeledPoint only, so you
couldn't feed in strings from a text file directly. LabeledPoint is a
wrapper around double values rather than strings. How were you trying
to create the input then?
No, it only accepts numeric values, although you can encode
categoric