Yes that help to understand better how works spark. But that was also what I was afraid, I think the network communications will take to much time for my job.
I will continue to look for a trick in order to not have network communications. I saw on the hadoop website that : "To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request" May if in a way I success to combine a part of spark and some of this, it could work. Thank you very much for you answer. Germain. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-handle-RDD-via-HDFS-tp4003p4058.html Sent from the Apache Spark User List mailing list archive at Nabble.com.