Hi Asaf,

featurestream [1] is an internal project I'm playing with that includes
support for some of this, in particular:
* 1-pass random forest construction
* schema inference
* native support for text fields

Would this be of interest? It's not open source, but if there's sufficient
demand I can get access to it.

[1] https://github.com/featurestream

On 16 January 2015 at 13:59, Nick Allen <n...@nickallen.org> wrote:

> An alternative approach would be to translate your categorical variables
> into dummy variables.  If your strings represent N classes/categories you
> would generate N-1 dummy variables containing 0/1 values.
>
> Auto-magically creating dummy variables from categorical data definitely
> comes in handy.  I assume this is what SPARK-1216 is referring to, but I am
> not sure from the description.
>
> https://issues.apache.org/jira/browse/SPARK-1216
>
> Auto-magically doing the scheme that Sean mentioned is referenced in
> SPARK-4081, I believe.
>
> https://issues.apache.org/jira/browse/SPARK-4081
>
>
>
> On Fri, Jan 16, 2015 at 4:45 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> The implementation accepts an RDD of LabeledPoint only, so you
>> couldn't feed in strings from a text file directly. LabeledPoint is a
>> wrapper around double values rather than strings. How were you trying
>> to create the input then?
>>
>> No, it only accepts numeric values, although you can encode
>> categorical values as 0, 1, 2 ... and tell the implementation about
>> your categorical features to use categorical features.
>>
>> On Fri, Jan 16, 2015 at 9:25 PM, Asaf Lahav <asaf.la...@gmail.com> wrote:
>> > Hi,
>> >
>> > I have been playing around with the new version of Spark MLlib Random
>> forest
>> > implementation, and while in the process, tried it with a file with
>> String
>> > Features.
>> > While training, it fails with:
>> > java.lang.NumberFormatException: For input string.
>> >
>> >
>> > Is MBLib Random forest adapted to run on top of numeric data only?
>> >
>> > Thanks
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Nick Allen <n...@nickallen.org>
>

Reply via email to