May be this is a bug. The source can be found at: https://github.com/purijatin/spark-retrain-bug
*Issue:* The program takes input a set of documents. Where each document is in a separate file. The spark program tf-idf of the terms (Tokenizer -> Stopword remover -> stemming -> tf -> tfidf). Once the training is complete, it unpersists and restarts the operation again. def main(args: Array[String]): Unit = { for { iter <- Stream.from(1, 1) } { val data = new DataReader(spark).data val pipelineData: Pipeline = new Pipeline().setStages(english("reviewtext") ++ english("summary")) val pipelineModel = pipelineData.fit(data) val all: DataFrame = pipelineModel.transform(data) .withColumn("rowid", functions.monotonically_increasing_id()) //evaluate the pipeline all.rdd.foreach(x => x) println(s"$iter - ${all.count()}. ${new Date()}") data.unpersist() } } When run with `-Xmx1500M` memory, it fails with OOME after about 5 iterations. *Temporary Fix:* When all the input documents are merged to a single file, then the issue is no longer found. Spark version: 2.3.0. Dependency information can be found here: https://github.com/purijatin/spark-retrain-bug/blob/master/project/Dependencies.scala Thanks, Jatin