Spark column combinations and combining multiple dataframes (pyspark)

Christopher Petrino Mon, 26 Nov 2018 09:56:33 -0800

Hi all, I'm working on a problem where it is necessary to find all
combinations of columns for a dataframe.


THE PROBLEM:

Let's say there is a dataframe with columns:
[ col_a, col_b, col_c, col_d, col_e, result ]

The number of combinations can vary between 1 and 5 but lets say 3 for this
case.

This would result in something like:
[ col_a, col_b, col_c, result ]
[ col_b, col_c, col_d, result ]
[ col_b, col_c, col_e, result ]
... etc

These would have their columns renamed and combined to result in:
[ col_1, col_2, col_3, result ]


THE CURRENT APPROACH:

The current approach is to use python and itertools.combinations to produce
all combinations. Then for each combination create a new dataframe from the
original dataframe using

df.select(cols).toDF(*cols).cache()

{Cache and count are being used to help track progress and resolve previous
issues}
The generated dataframe is added to a python array like

the_dfs.append(df.select(cols).toDF(*cols).cache())
the_dfs[len(the_dfs)].count()

The dataframes are finally combined using

df_all = reduce(DataFrame.union, the_dfs).cache()
df_all.count()


THE CURRENT STATE:

The proof of concept works on a smaller amount of data but as the data size
increases this approach has proven to be unreliable. I've had several
blocking issues that I've resolved by caching dataframes.

Can anyone provide any criticism on the current approach or advice on a
different approach?

Spark column combinations and combining multiple dataframes (pyspark)

Reply via email to