Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Filipp Zhinkin
Hi Shahab, do you actually need to have a few columns with such a huge amount of categories whose value depends on original value's frequency? If no, then you may use value's hash code as a category or combine all columns into a single vector using HashingTF. Regards, Filipp. On Tue, Apr 10, 20

Re: ML Transformer: create feature that uses multiple columns

2017-12-09 Thread Filipp Zhinkin
Hi, you can combine multiple columns using org.apache.spark.sql.functions.struct and invoke UDF on resulting column. In that case your UDF have to accept Row as an argument. See VectorAssermber's sources for example: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spar

[ML] LogisticRegression and dataset's standardization before training

2017-12-06 Thread Filipp Zhinkin
Hi, LogisticAggregator [1] scales every sample on every iteration. Without scaling binaryUpdateInPlace could be rewritten using BLAS.dot and that would significantly improve performance. However, there is a comment [2] saying that standardization and caching of the dataset before training will "cr