Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-30 Thread Daniel Chalef
ate instead of a pivot, and assembling the vector using > a UDF. > > On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef > wrote: > >> Hello, >> >> I have a very large long-format dataframe (several billion rows) that I'd >> like to pivot and vectorize (using th

[Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-29 Thread Daniel Chalef
Hello, I have a very large long-format dataframe (several billion rows) that I'd like to pivot and vectorize (using the VectorAssembler), with the aim to reduce dimensionality using something akin to TF-IDF. Once pivoted, the dataframe will have ~130 million columns. The source, long-format schem