I have a dataset that I need to convert some of the the variables to dummy
variables. The get_dummies function in Pandas works perfectly on smaller
datasets but since it collects I'll always be bottlenecked by the master
node.
I've looked at Spark's OHE feature and while that will work in theory I
When I spin up an AWS Spark cluster per the Spark EC2 script:
According to AWS:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html#fixed-duration-spot-instances
there is a way of reserving for a fixed duration Spot cluster through AWSCLI
and the web portal but I can't find any