Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

Daniel Chalef Fri, 30 Oct 2020 09:41:54 -0700

Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will
explore ways of doing this using an agg and UDF.


On Fri, Oct 30, 2020 at 6:26 AM Patrick McCarthy
<pmccar...@dstillery.com.invalid> wrote:

> That's a very large vector. Is it sparse? Perhaps you'd have better luck
> performing an aggregate instead of a pivot, and assembling the vector using
> a UDF.
>
> On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef
> <daniel.cha...@sparkpost.com.invalid> wrote:
>
>> Hello,
>>
>> I have a very large long-format dataframe (several billion rows) that I'd
>> like to pivot and vectorize (using the VectorAssembler), with the aim to
>> reduce dimensionality using something akin to TF-IDF. Once pivoted, the
>> dataframe will have ~130 million columns.
>>
>> The source, long-format schema looks as follows:
>>
>> root
>>  |-- entity_id: long (nullable = false)
>>  |-- attribute_id: long (nullable = false)
>>  |-- event_count: integer (nullable = true)
>>
>> Pivoting as per the following fails, exhausting executor and driver
>> memory. I am unsure whether increasing memory limits would be successful
>> here as my sense is that pivoting and then using a VectorAssembler isn't
>> the right approach to solving this problem.
>>
>> wide_frame = (
>>     long_frame.groupBy("entity_id")
>>     .pivot("attribute_id")
>>     .agg(F.first("event_count"))
>> )
>>
>> Are there other Spark patterns that I should attempt in order to achieve
>> my end goal of a vector of attributes for every entity?
>>
>> Thanks, Daniel
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

Reply via email to