For now OHE supports a single column. So you have to have 1000 OHE in a
pipeline. However you can add them programatically so it is not too bad. If
the cardinality of each feature is quite low, it should be workable.

After that user VectorAssembler to stitch the vectors together (which
accepts multiple input columns).

The other approach is - if your features are all categorical - to encode
the features as "feature_name=feature_value" strings. This can
unfortunately only be done with RDD ops since a UDF can't accept multiple
columns as input at this time. You can create a new column with all the
feature name/value pairs as a list of strings ["feature_1=foo",
"feature_2=bar", ...]. Then use CountVectorizer to create your binary
vectors. This basically works like the DictVectorizer in scikit-learn.



On Fri, 11 Nov 2016 at 20:33 nsharkey <nicholasshar...@gmail.com> wrote:

> I have a dataset that I need to convert some of the the variables to dummy
> variables. The get_dummies function in Pandas works perfectly on smaller
> datasets but since it collects I'll always be bottlenecked by the master
> node.
>
> I've looked at Spark's OHE feature and while that will work in theory I
> have over a thousand variables I need to convert so I don't want to have to
> do 1000+ OHE. My project is pretty simple in scope: read in a raw CSV,
> convert the categorical variables into dummy variables, then save the
> transformed data back to CSV. That is why I'm so interested in get_dummies
> but it's not scalable enough for my data size (500-600GB per file).
>
> Thanks in advance.
>
> Nick
>
> ------------------------------
> View this message in context: Finding a Spark Equivalent for Pandas'
> get_dummies
> <http://apache-spark-user-list.1001560.n3.nabble.com/Finding-a-Spark-Equivalent-for-Pandas-get-dummies-tp28064.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Reply via email to