Hi Asaf, featurestream [1] is an internal project I'm playing with that includes support for some of this, in particular: * 1-pass random forest construction * schema inference * native support for text fields
Would this be of interest? It's not open source, but if there's sufficient demand I can get access to it. [1] https://github.com/featurestream On 16 January 2015 at 13:59, Nick Allen <n...@nickallen.org> wrote: > An alternative approach would be to translate your categorical variables > into dummy variables. If your strings represent N classes/categories you > would generate N-1 dummy variables containing 0/1 values. > > Auto-magically creating dummy variables from categorical data definitely > comes in handy. I assume this is what SPARK-1216 is referring to, but I am > not sure from the description. > > https://issues.apache.org/jira/browse/SPARK-1216 > > Auto-magically doing the scheme that Sean mentioned is referenced in > SPARK-4081, I believe. > > https://issues.apache.org/jira/browse/SPARK-4081 > > > > On Fri, Jan 16, 2015 at 4:45 PM, Sean Owen <so...@cloudera.com> wrote: > >> The implementation accepts an RDD of LabeledPoint only, so you >> couldn't feed in strings from a text file directly. LabeledPoint is a >> wrapper around double values rather than strings. How were you trying >> to create the input then? >> >> No, it only accepts numeric values, although you can encode >> categorical values as 0, 1, 2 ... and tell the implementation about >> your categorical features to use categorical features. >> >> On Fri, Jan 16, 2015 at 9:25 PM, Asaf Lahav <asaf.la...@gmail.com> wrote: >> > Hi, >> > >> > I have been playing around with the new version of Spark MLlib Random >> forest >> > implementation, and while in the process, tried it with a file with >> String >> > Features. >> > While training, it fails with: >> > java.lang.NumberFormatException: For input string. >> > >> > >> > Is MBLib Random forest adapted to run on top of numeric data only? >> > >> > Thanks >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > Nick Allen <n...@nickallen.org> >