subject:"Re\: Spark random forest \- string data"

Re: Spark random forest - string data

2015-01-16 Thread Andy Twigg

Hi Asaf, featurestream [1] is an internal project I'm playing with that includes support for some of this, in particular: * 1-pass random forest construction * schema inference * native support for text fields Would this be of interest? It's not open source, but if there's sufficient demand I can

Re: Spark random forest - string data

2015-01-16 Thread Nick Allen

An alternative approach would be to translate your categorical variables into dummy variables. If your strings represent N classes/categories you would generate N-1 dummy variables containing 0/1 values. Auto-magically creating dummy variables from categorical data definitely comes in handy. I a

Re: Spark random forest - string data

2015-01-16 Thread Sean Owen

The implementation accepts an RDD of LabeledPoint only, so you couldn't feed in strings from a text file directly. LabeledPoint is a wrapper around double values rather than strings. How were you trying to create the input then? No, it only accepts numeric values, although you can encode categoric