Spark dataframe creation through already distributed in-memory data sets

2020-06-16 Thread Tanveer Ahmad - EWI
Hi all, I am new to the Spark community. Please ignore if this question doesn't make sense. My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec). Explanation: I have a huge Arrow RecordBatches collection which is equally

GroupBy issue while running K-Means - Dataframe

2020-06-16 Thread Deepak Sharma
Hi All, I have a custom implementation of K-Means where it needs the data to be grouped by a key in a dataframe. Now there is a big data skew for some of the keys , where it exceeds the BufferHolder: Cannot grow BufferHolder by size 17112 because the size after growing exceeds size limitation 2147