Perhaps it can avoid errors(exhausting executor and driver memory) to add random numbers to the entity_id column when you solve the issue by Patrick's way.
Daniel Chalef <daniel.cha...@sparkpost.com.invalid> 于2020年10月31日周六 上午12:42写道: > Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will > explore ways of doing this using an agg and UDF. > > On Fri, Oct 30, 2020 at 6:26 AM Patrick McCarthy > <pmccar...@dstillery.com.invalid> wrote: > >> That's a very large vector. Is it sparse? Perhaps you'd have better luck >> performing an aggregate instead of a pivot, and assembling the vector using >> a UDF. >> >> On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef >> <daniel.cha...@sparkpost.com.invalid> wrote: >> >>> Hello, >>> >>> I have a very large long-format dataframe (several billion rows) that >>> I'd like to pivot and vectorize (using the VectorAssembler), with the aim >>> to reduce dimensionality using something akin to TF-IDF. Once pivoted, the >>> dataframe will have ~130 million columns. >>> >>> The source, long-format schema looks as follows: >>> >>> root >>> |-- entity_id: long (nullable = false) >>> |-- attribute_id: long (nullable = false) >>> |-- event_count: integer (nullable = true) >>> >>> Pivoting as per the following fails, exhausting executor and driver >>> memory. I am unsure whether increasing memory limits would be successful >>> here as my sense is that pivoting and then using a VectorAssembler isn't >>> the right approach to solving this problem. >>> >>> wide_frame = ( >>> long_frame.groupBy("entity_id") >>> .pivot("attribute_id") >>> .agg(F.first("event_count")) >>> ) >>> >>> Are there other Spark patterns that I should attempt in order to achieve >>> my end goal of a vector of attributes for every entity? >>> >>> Thanks, Daniel >>> >> >> >> -- >> >> >> *Patrick McCarthy * >> >> Senior Data Scientist, Machine Learning Engineering >> >> Dstillery >> >> 470 Park Ave South, 17th Floor, NYC 10016 >> >