Finding a Spark Equivalent for Pandas' get_dummies

Nicholas Sharkey Fri, 11 Nov 2016 10:27:47 -0800

I have a dataset that I need to convert some of the the variables to dummy
variables. The get_dummies function in Pandas works perfectly on smaller
datasets but since it collects I'll always be bottlenecked by the master
node.


I've looked at Spark's OHE feature and while that will work in theory I
have over a thousand variables I need to convert so I don't want to have to
do 1000+ OHE. My project is pretty simple in scope: read in a raw CSV,
convert the categorical variables into dummy variables, then save the
transformed data back to CSV. That is why I'm so interested in get_dummies
but it's not scalable enough for my data size (500-600GB per file).

Thanks in advance.

Nick

Finding a Spark Equivalent for Pandas' get_dummies

Reply via email to