Hi Cheng, You are right. =)
I checked your article a few days ago. I have some further questions: According to the article, the following code takes the spatial complexity o(1). val lines = spark.textFile("hdfs://<input>") val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split(" ")(1)) messages.saveAsTextFile("hdfs://<output>") It reads one line from HDFS at a time, apply filter closure, then map closure on the line But as far as I know, map, filter, etc are operations based on partitions. (If I am wrong, correct me please.) So, essentially, it is a partition that is loaded into memory, a task will process one partition at a time on a cpu core. And the iterator is based on the partition. If the partition is too large to fit into memory, we will have a OOM. I read some posts about OOM, increasing the number of partition is a common solution. I am not sure if my understanding is right. Help needed. Thank you. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-spark-operation-pipeline-and-block-storage-tp18201p19074.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org