Hello, Is it possible in Spark to map partitions such that partitions are column-based and not row-based? My use-case is to compute temporal series of numerical values. I.e: Exponential moving averages over the values of a given dataset's column.
Suppose there is a dataset with roughly 200 columns, a high percentage of which are numerical (> 60%) and at least one timestamp column, as shown in the attached file. I want to shuffle data to executors such that each executor has a smaller dataset with only 2 columns, [Col0: Timestamp, Col<X>: Numerical type]. Over which I can then sort the dataset by increasing timestamp and then iterate over the rows with a custom function which receives a tuple: {timestamp; value}. Partitoning by column value does not make sense for me since there is a temporal lineage of values which I must keep. On the other hand I would like to parallelize this workload as my datasets can be quite big (> 2 billion rows). The only way I see how is to distribute the entire columns so that each executor has 2B timestamp + numerical values rather than 2B*size of an entire row. Is this possible in Spark? Can someone point in the right direction? A code snippet example (not working is fine if the logic is sound) would be highly appreciated! Thank you for your time. -- *Pedro Cardoso* *Research Engineer* pedro.card...@feedzai.com [image: Follow Feedzai on Facebook.] <https://www.facebook.com/Feedzai/>[image: Follow Feedzai on Twitter!] <https://twitter.com/feedzai>[image: Connect with Feedzai on LinkedIn!] <https://www.linkedin.com/company/feedzai/> [image: Feedzai best in class aite report] <https://feedzai.com/press-releases/aite-group-names-feedzai-market-leader/> *The content of this email is confidential and intended for the recipient specified in message only. It is strictly prohibited to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.* -- The content of this email is confidential and intended for the recipient specified in message only. It is strictly prohibited to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org