Hi Cheng,

You are right. =)

I checked your article a few days ago. I have some further questions:

According to the article, the following code takes the spatial complexity
o(1).

val lines = spark.textFile("hdfs://<input>")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split(" ")(1))
messages.saveAsTextFile("hdfs://<output>")

It reads one line from HDFS at a time, apply filter closure, then map
closure on the line

But as far as I know, map, filter, etc are operations based on partitions.
(If I am wrong, correct me please.) So, essentially, it is a partition that
is loaded into memory, a task will process one partition at a time on a cpu
core. And the iterator is based on the partition.

If the partition is too large to fit into memory, we will have a OOM. I read
some posts about OOM, increasing the number of partition is a common
solution.

I am not sure if my understanding is right. Help needed.

Thank you.

Hao





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-spark-operation-pipeline-and-block-storage-tp18201p19074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to