Hi, I want to one hot encode a column containing 56 million distinct values. My dataset is 800m rows + 17 columns. I first apply a StringIndexer, but it already breaks there giving a OOM java heap space error.
I launch my app on YARN with: /opt/spark/2.0.0/bin/spark-shell --executor-memory 10G --num-executors 128 --executor-cores 2 --driver-memory 12G --conf spark.driver.maxResultSize=8G After grabbing the data, I run: val catInts = Array(“bigfeature”) val stagesIndex = scala.collection.mutable.ArrayBuffer.empty[StringIndexer] val stagesOhe = scala.collection.mutable.ArrayBuffer.empty[OneHotEncoder] for (c <- catInts) { println(s"starting with $c") val i = new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}Index") stagesIndex += i val o = new OneHotEncoder() .setDropLast(false) .setInputCol(s"${c}Index") .setOutputCol(s"${c}OHE") stagesOhe += o } println(s"Applying string indexers: fitting") val pipelined = new Pipeline().setStages(stagesIndex.toArray) val dfFitted = pipelined.fit(df) Then, the application master shows a "countByValue at StringIndexer.scala” taking 1.8 minutes (so very fast). Afterwards, the shell console hangs for a while. What is it doing now? After some time, it shows: scala> val dfFitted = pipelined.fit(df) java.lang.OutOfMemoryError: Java heap space at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:141) at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:139) at org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159) at org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:230) at org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:167) at org.apache.spark.util.collection.OpenHashMap$mcD$sp.update$mcD$sp(OpenHashMap.scala:86) at org.apache.spark.ml.feature.StringIndexerModel.<init>(StringIndexer.scala:137) at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:93) at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:66) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:145) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44) at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:145) ... 16 elided