May be this is a bug. The source can be found at:
https://github.com/purijatin/spark-retrain-bug

*Issue:*
The program  takes input a set of documents. Where each document is in a
separate file.
The spark program tf-idf of the terms  (Tokenizer -> Stopword remover ->
stemming -> tf -> tfidf).
Once the training is complete, it unpersists and restarts the operation
again.

def main(args: Array[String]): Unit = {
  for {
    iter <- Stream.from(1, 1)
  } {
    val data = new DataReader(spark).data

    val pipelineData: Pipeline = new
Pipeline().setStages(english("reviewtext") ++ english("summary"))

    val pipelineModel = pipelineData.fit(data)

    val all: DataFrame = pipelineModel.transform(data)
      .withColumn("rowid", functions.monotonically_increasing_id())

    //evaluate the pipeline
    all.rdd.foreach(x => x)
    println(s"$iter - ${all.count()}. ${new Date()}")
    data.unpersist()
  }
}



When run with  `-Xmx1500M` memory, it fails with OOME after about 5
iterations.

*Temporary Fix:*
When all the input documents are merged to a single file, then the issue is
no longer found.


Spark version: 2.3.0. Dependency information can be found here:
https://github.com/purijatin/spark-retrain-bug/blob/master/project/Dependencies.scala

Thanks,
Jatin

Reply via email to