Re: Trying to hash cross features with mllib

Sean Owen Fri, 01 Oct 2021 06:50:08 -0700

Are you looking for
https://spark.apache.org/docs/latest/ml-features.html#interaction ? That's
the closest built in thing I can think of.  Otherwise you can make custom
transformations.


On Fri, Oct 1, 2021, 8:44 AM David Diebold <davidjdieb...@gmail.com> wrote:

> Hello everyone,
>
> In MLLib, I’m trying to rely essentially on pipelines to create features
> out of the Titanic dataset, and show-case the power of feature hashing. I
> want to:
>
> -          Apply bucketization on some columns (QuantileDiscretizer is
> fine)
>
> -          Then I want to cross all my columns with each other to have
> cross features.
>
> -          Then I would like to hash all of these cross features into a
> vector.
>
> -          Then give it to a logistic regression.
>
> Looking at the documentation, it looks like the only way to hash features
> is the *FeatureHasher* transformation. It takes multiple columns as
> input, type can be numeric, bool, string (but no vector/array).
>
> But now I’m left wondering how I can create my cross-feature columns. I’m
> looking at a transformation that could take two columns as input, and
> return a numeric, bool, or string. I didn't manage to find anything that
> does the job. There are multiple transformations such as VectorAssembler,
> that operate on vector, but this is not a typeaccepted by the FeatureHasher.
>
> Of course, I could try to combine columns directly in my dataframe (before
> the pipeline kicks-in), but then I would not be able to benefit any more
> from QuantileDiscretizer and other cool functions.
>
>
> Am I missing something in the transformation api ? Or is my approach to
> hashing wrong ? Or should we consider to extend the api somehow ?
>
>
>
> Thank you, kind regards,
>
> David
>

Re: Trying to hash cross features with mllib

Reply via email to