OOM with StringIndexer, 800m rows & 56m distinct value column

Ben Teeuwen Wed, 03 Aug 2016 07:00:25 -0700

Hi,

I want to one hot encode a column containing 56 million distinct values. My 
dataset is 800m rows + 17 columns.
I first apply a StringIndexer, but it already breaks there giving a OOM java 
heap space error.


I launch my app on YARN with:
/opt/spark/2.0.0/bin/spark-shell --executor-memory 10G --num-executors 128 
--executor-cores 2 --driver-memory 12G --conf spark.driver.maxResultSize=8G

After grabbing the data, I run:

val catInts = Array(“bigfeature”)

val stagesIndex = scala.collection.mutable.ArrayBuffer.empty[StringIndexer]
val stagesOhe = scala.collection.mutable.ArrayBuffer.empty[OneHotEncoder]
for (c <- catInts) {
  println(s"starting with $c")
  val i = new StringIndexer()
    .setInputCol(c)
    .setOutputCol(s"${c}Index")
  stagesIndex += i

  val o = new OneHotEncoder()
    .setDropLast(false)
    .setInputCol(s"${c}Index")
    .setOutputCol(s"${c}OHE")
  stagesOhe += o
}

println(s"Applying string indexers: fitting")
val pipelined = new Pipeline().setStages(stagesIndex.toArray)
val dfFitted = pipelined.fit(df)


Then, the application master shows a "countByValue at StringIndexer.scala” 
taking 1.8 minutes (so very fast). 
Afterwards, the shell console hangs for a while. What is it doing now? After 
some time, it shows:

scala> val dfFitted = pipelined.fit(df)
                                                                      
java.lang.OutOfMemoryError: Java heap space
  at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:141)
  at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:139)
  at 
org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159)
  at org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:230)
  at 
org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:167)
  at 
org.apache.spark.util.collection.OpenHashMap$mcD$sp.update$mcD$sp(OpenHashMap.scala:86)
  at 
org.apache.spark.ml.feature.StringIndexerModel.<init>(StringIndexer.scala:137)
  at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:93)
  at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:66)
  at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
  at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:145)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
  at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
  at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:145)
  ... 16 elided

OOM with StringIndexer, 800m rows & 56m distinct value column

Reply via email to