I think you can just select the columns you need into new DataFrames, then
process those separately.

val dfFirstTwo = ds.select("Col1", "Col2")
# do whatever with this one
dfFirstTwo.sort(...)
# similar for the next two columns
val dfNextTwo = ds.select("Col3", "Col4")
dfNextTwo.sort(...)

These should result in separate tasks, which you could confirm by checking
the Spark UI when the application is submitted.

On Thu, Sep 24, 2020 at 7:01 AM Pedro Cardoso <pedro.card...@feedzai.com>
wrote:

> Hello,
>
> Is it possible in Spark to map partitions such that partitions are
> column-based and not row-based?
> My use-case is to compute temporal series of numerical values.
> I.e: Exponential moving averages over the values of a given dataset's
> column.
>
> Suppose there is a dataset with roughly 200 columns, a high percentage of
> which are numerical (> 60%) and at least one timestamp column, as shown in
> the attached file.
>
> I want to shuffle data to executors such that each executor has a smaller
> dataset with only 2 columns, [Col0: Timestamp, Col<X>: Numerical type].
> Over which I can then sort the dataset by increasing timestamp and then
> iterate over the rows with a custom function which receives a tuple:
> {timestamp; value}.
>
> Partitoning by column value does not make sense for me since there is a
> temporal lineage of values which I must keep. On the other hand I would
> like to parallelize this workload as my datasets can be quite big (> 2
> billion rows). The only way I see how is to distribute the entire columns
> so that each executor has 2B timestamp + numerical values rather than
> 2B*size of an entire row.
>
> Is this possible in Spark? Can someone point in the right direction? A
> code snippet example (not working is fine if the logic is sound) would be
> highly appreciated!
>
> Thank you for your time.
> --
>
> *Pedro Cardoso*
>
> *Research Engineer*
>
> pedro.card...@feedzai.com
>
>
> [image: Follow Feedzai on Facebook.] 
> <https://www.facebook.com/Feedzai/>[image:
> Follow Feedzai on Twitter!] <https://twitter.com/feedzai>[image: Connect
> with Feedzai on LinkedIn!] <https://www.linkedin.com/company/feedzai/>
>
>
> [image: Feedzai best in class aite report]
> <https://feedzai.com/press-releases/aite-group-names-feedzai-market-leader/>
>
> *The content of this email is confidential and intended for the recipient
> specified in message only. It is strictly prohibited to share any part of
> this message with any third party, without a written consent of the sender.
> If you received this message by mistake, please reply to this message and
> follow with its deletion, so that we can ensure such a mistake does not
> occur in the future.*
>
> *The content of this email is confidential and intended for the recipient
> specified in message only. It is strictly prohibited to share any part of
> this message with any third party, without a written consent of the sender.
> If you received this message by mistake, please reply to this message and
> follow with its deletion, so that we can ensure such a mistake does not
> occur in the future.*
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to