I have a dataset that I need to convert some of the the variables to dummy variables. The get_dummies function in Pandas works perfectly on smaller datasets but since it collects I'll always be bottlenecked by the master node.
I've looked at Spark's OHE feature and while that will work in theory I have over a thousand variables I need to convert so I don't want to have to do 1000+ OHE. My project is pretty simple in scope: read in a raw CSV, convert the categorical variables into dummy variables, then save the transformed data back to CSV. That is why I'm so interested in get_dummies but it's not scalable enough for my data size (500-600GB per file). Thanks in advance. Nick