from:"Filipp Zhinkin"

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Filipp Zhinkin

Hi Shahab, do you actually need to have a few columns with such a huge amount of categories whose value depends on original value's frequency? If no, then you may use value's hash code as a category or combine all columns into a single vector using HashingTF. Regards, Filipp. On Tue, Apr 10, 20

Re: ML Transformer: create feature that uses multiple columns

2017-12-09 Thread Filipp Zhinkin

Hi, you can combine multiple columns using org.apache.spark.sql.functions.struct and invoke UDF on resulting column. In that case your UDF have to accept Row as an argument. See VectorAssermber's sources for example: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spar

[ML] LogisticRegression and dataset's standardization before training

2017-12-06 Thread Filipp Zhinkin

Hi, LogisticAggregator [1] scales every sample on every iteration. Without scaling binaryUpdateInPlace could be rewritten using BLAS.dot and that would significantly improve performance. However, there is a comment [2] saying that standardization and caching of the dataset before training will "cr

Re: StringIndexer with high cardinality huge data

Re: ML Transformer: create feature that uses multiple columns

[ML] LogisticRegression and dataset's standardization before training

3 matches

Site Navigation

Mail list logo

Footer information