Distribute entire columns to executors

Pedro Cardoso Thu, 24 Sep 2020 05:01:40 -0700

Hello,

Is it possible in Spark to map partitions such that partitions are
column-based and not row-based?
My use-case is to compute temporal series of numerical values.
I.e: Exponential moving averages over the values of a given dataset's
column.


Suppose there is a dataset with roughly 200 columns, a high percentage of
which are numerical (> 60%) and at least one timestamp column, as shown in
the attached file.

I want to shuffle data to executors such that each executor has a smaller
dataset with only 2 columns, [Col0: Timestamp, Col<X>: Numerical type].
Over which I can then sort the dataset by increasing timestamp and then
iterate over the rows with a custom function which receives a tuple:
{timestamp; value}.

Partitoning by column value does not make sense for me since there is a
temporal lineage of values which I must keep. On the other hand I would
like to parallelize this workload as my datasets can be quite big (> 2
billion rows). The only way I see how is to distribute the entire columns
so that each executor has 2B timestamp + numerical values rather than
2B*size of an entire row.

Is this possible in Spark? Can someone point in the right direction? A code
snippet example (not working is fine if the logic is sound) would be highly
appreciated!

Thank you for your time.
--

*Pedro Cardoso*

*Research Engineer*

pedro.card...@feedzai.com


[image: Follow Feedzai on Facebook.] <https://www.facebook.com/Feedzai/>[image:
Follow Feedzai on Twitter!] <https://twitter.com/feedzai>[image: Connect
with Feedzai on LinkedIn!] <https://www.linkedin.com/company/feedzai/>


[image: Feedzai best in class aite report]
<https://feedzai.com/press-releases/aite-group-names-feedzai-market-leader/>

*The content of this email is confidential and intended for the recipient
specified in message only. It is strictly prohibited to share any part of
this message with any third party, without a written consent of the sender.
If you received this message by mistake, please reply to this message and
follow with its deletion, so that we can ensure such a mistake does not
occur in the future.*

-- 
The content of this email is confidential and 
intended for the recipient 
specified in message only. It is strictly 
prohibited to share any part of 
this message with any third party, 
without a written consent of the 
sender. If you received this message by
 mistake, please reply to this 
message and follow with its deletion, so 
that we can ensure such a mistake 
does not occur in the future.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Distribute entire columns to executors

Reply via email to