Hi Shahab,
do you actually need to have a few columns with such a huge amount of
categories whose value depends on original value's frequency?
If no, then you may use value's hash code as a category or combine all
columns into a single vector using HashingTF.
Regards,
Filipp.
On Tue, Apr 10, 20
Hi,
you can combine multiple columns using
org.apache.spark.sql.functions.struct and invoke UDF on resulting
column.
In that case your UDF have to accept Row as an argument.
See VectorAssermber's sources for example:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spar
Hi,
LogisticAggregator [1] scales every sample on every iteration. Without
scaling binaryUpdateInPlace could be rewritten using BLAS.dot and that
would significantly improve performance.
However, there is a comment [2] saying that standardization and
caching of the dataset before training will "cr