Are you looking for https://spark.apache.org/docs/latest/ml-features.html#interaction ? That's the closest built in thing I can think of. Otherwise you can make custom transformations.
On Fri, Oct 1, 2021, 8:44 AM David Diebold <davidjdieb...@gmail.com> wrote: > Hello everyone, > > In MLLib, I’m trying to rely essentially on pipelines to create features > out of the Titanic dataset, and show-case the power of feature hashing. I > want to: > > - Apply bucketization on some columns (QuantileDiscretizer is > fine) > > - Then I want to cross all my columns with each other to have > cross features. > > - Then I would like to hash all of these cross features into a > vector. > > - Then give it to a logistic regression. > > Looking at the documentation, it looks like the only way to hash features > is the *FeatureHasher* transformation. It takes multiple columns as > input, type can be numeric, bool, string (but no vector/array). > > But now I’m left wondering how I can create my cross-feature columns. I’m > looking at a transformation that could take two columns as input, and > return a numeric, bool, or string. I didn't manage to find anything that > does the job. There are multiple transformations such as VectorAssembler, > that operate on vector, but this is not a typeaccepted by the FeatureHasher. > > Of course, I could try to combine columns directly in my dataframe (before > the pipeline kicks-in), but then I would not be able to benefit any more > from QuantileDiscretizer and other cool functions. > > > Am I missing something in the transformation api ? Or is my approach to > hashing wrong ? Or should we consider to extend the api somehow ? > > > > Thank you, kind regards, > > David >