Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will explore ways of doing this using an agg and UDF.
On Fri, Oct 30, 2020 at 6:26 AM Patrick McCarthy <pmccar...@dstillery.com.invalid> wrote: > That's a very large vector. Is it sparse? Perhaps you'd have better luck > performing an aggregate instead of a pivot, and assembling the vector using > a UDF. > > On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef > <daniel.cha...@sparkpost.com.invalid> wrote: > >> Hello, >> >> I have a very large long-format dataframe (several billion rows) that I'd >> like to pivot and vectorize (using the VectorAssembler), with the aim to >> reduce dimensionality using something akin to TF-IDF. Once pivoted, the >> dataframe will have ~130 million columns. >> >> The source, long-format schema looks as follows: >> >> root >> |-- entity_id: long (nullable = false) >> |-- attribute_id: long (nullable = false) >> |-- event_count: integer (nullable = true) >> >> Pivoting as per the following fails, exhausting executor and driver >> memory. I am unsure whether increasing memory limits would be successful >> here as my sense is that pivoting and then using a VectorAssembler isn't >> the right approach to solving this problem. >> >> wide_frame = ( >> long_frame.groupBy("entity_id") >> .pivot("attribute_id") >> .agg(F.first("event_count")) >> ) >> >> Are there other Spark patterns that I should attempt in order to achieve >> my end goal of a vector of attributes for every entity? >> >> Thanks, Daniel >> > > > -- > > > *Patrick McCarthy * > > Senior Data Scientist, Machine Learning Engineering > > Dstillery > > 470 Park Ave South, 17th Floor, NYC 10016 >