StringIndexer with high cardinality huge data

Shahab Yunus Tue, 10 Apr 2018 06:02:49 -0700

Is the StringIndexer
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala>
keeps all the mapped label to indices in the memory of the driver machine?
It seems to be unless I am missing something.


What if our data that needs to be indexed is huge and columns to be indexed
are high cardinality (or with lots of categories) and more than one such
column need to be indexed? Meaning it wouldn't fit in memory.

Thanks.

Regards,
Shahab

StringIndexer with high cardinality huge data

Reply via email to