Hi all, I'm working on a problem where it is necessary to find all combinations of columns for a dataframe.
THE PROBLEM: Let's say there is a dataframe with columns: [ col_a, col_b, col_c, col_d, col_e, result ] The number of combinations can vary between 1 and 5 but lets say 3 for this case. This would result in something like: [ col_a, col_b, col_c, result ] [ col_b, col_c, col_d, result ] [ col_b, col_c, col_e, result ] ... etc These would have their columns renamed and combined to result in: [ col_1, col_2, col_3, result ] THE CURRENT APPROACH: The current approach is to use python and itertools.combinations to produce all combinations. Then for each combination create a new dataframe from the original dataframe using df.select(cols).toDF(*cols).cache() {Cache and count are being used to help track progress and resolve previous issues} The generated dataframe is added to a python array like the_dfs.append(df.select(cols).toDF(*cols).cache()) the_dfs[len(the_dfs)].count() The dataframes are finally combined using df_all = reduce(DataFrame.union, the_dfs).cache() df_all.count() THE CURRENT STATE: The proof of concept works on a smaller amount of data but as the data size increases this approach has proven to be unreliable. I've had several blocking issues that I've resolved by caching dataframes. Can anyone provide any criticism on the current approach or advice on a different approach?