Yes, I can implement like: sc.textFile(“path”).reduce(_ + _).split(“!!”).filter(x => x.trim.length > 0)
But the reduce operation is expensive! I tested these two methods on a 6G file, the only operation with the created RDD is take(10).foreach(println), the method using newAPIHadoopFile only takes 2s while the code above will block for more than 1min because of the reduce I think. Would you post the code snippet to illustrate your idea? I didn't come up with an easy map, filter operation sequences on the RDD returned by textFile. Thanks! ----- Senior in Tsinghua Univ. github: http://www.github.com/uronce-cc -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10773.html Sent from the Apache Spark User List mailing list archive at Nabble.com.