Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

kevin chen Sun, 01 Nov 2020 01:51:03 -0800

Perhaps it can avoid errors(exhausting executor and driver memory) to add
random numbers to the entity_id column when you solve the issue by
Patrick's way.


Daniel Chalef <daniel.cha...@sparkpost.com.invalid> 于2020年10月31日周六
上午12:42写道：

> Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will
> explore ways of doing this using an agg and UDF.
>
> On Fri, Oct 30, 2020 at 6:26 AM Patrick McCarthy
> <pmccar...@dstillery.com.invalid> wrote:
>
>> That's a very large vector. Is it sparse? Perhaps you'd have better luck
>> performing an aggregate instead of a pivot, and assembling the vector using
>> a UDF.
>>
>> On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef
>> <daniel.cha...@sparkpost.com.invalid> wrote:
>>
>>> Hello,
>>>
>>> I have a very large long-format dataframe (several billion rows) that
>>> I'd like to pivot and vectorize (using the VectorAssembler), with the aim
>>> to reduce dimensionality using something akin to TF-IDF. Once pivoted, the
>>> dataframe will have ~130 million columns.
>>>
>>> The source, long-format schema looks as follows:
>>>
>>> root
>>>  |-- entity_id: long (nullable = false)
>>>  |-- attribute_id: long (nullable = false)
>>>  |-- event_count: integer (nullable = true)
>>>
>>> Pivoting as per the following fails, exhausting executor and driver
>>> memory. I am unsure whether increasing memory limits would be successful
>>> here as my sense is that pivoting and then using a VectorAssembler isn't
>>> the right approach to solving this problem.
>>>
>>> wide_frame = (
>>>     long_frame.groupBy("entity_id")
>>>     .pivot("attribute_id")
>>>     .agg(F.first("event_count"))
>>> )
>>>
>>> Are there other Spark patterns that I should attempt in order to achieve
>>> my end goal of a vector of attributes for every entity?
>>>
>>> Thanks, Daniel
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

Reply via email to