Yes, I can implement like:

sc.textFile(“path”).reduce(_ + _).split(“!!”).filter(x => x.trim.length > 0)

But the reduce operation is expensive! I tested these two methods on a 6G
file, the only operation with the created RDD is take(10).foreach(println),
the method using newAPIHadoopFile only takes 2s while the code above will
block for more than 1min because of the reduce I think.

Would you post the code snippet to illustrate your idea? I didn't come up
with an easy map, filter operation sequences on the RDD returned by
textFile. Thanks!




-----
Senior in Tsinghua Univ.
github: http://www.github.com/uronce-cc
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10773.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to