Finding a Spark Equivalent for Pandas' get_dummies

2016-11-11 Thread nsharkey
I have a dataset that I need to convert some of the the variables to dummy variables. The get_dummies function in Pandas works perfectly on smaller datasets but since it collects I'll always be bottlenecked by the master node. I've looked at Spark's OHE feature and while that will work in theory I

Specifying Fixed Duration (Spot Block) for AWS Spark EC2 Cluster

2016-07-04 Thread nsharkey
When I spin up an AWS Spark cluster per the Spark EC2 script: According to AWS: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html#fixed-duration-spot-instances there is a way of reserving for a fixed duration Spot cluster through AWSCLI and the web portal but I can't find any