Yes, I can implement like: sc.textFile(“path”).reduce(_ + _).split(“!!”).filter(x => x.trim.length > 0)
But the reduce operation is expensive! I tested these two methods on a 6G file, the only operation with the created RDD is take(10).foreach(println), the method using newAPIHadoopFile only takes 2s while the code above will block for more than 1min because of the reduce I think. Would you post the code snippet to illustrate your idea? I didn't come up with an easy map, filter operation sequences on the RDD returned by textFile. Thanks! ================================ 常铖 cheng chang Computer Science Dept. Tsinghua Univ. Mobile Phone: 13681572414 WeChat ID: cccjcl ================================ 在 2014年7月28日 下午5:40:21, chang cheng (myai...@gmail.com) 写到: the value in (key, value) returned by textFile is exactly one line of the input. But what I want is the field between the two “!!”, hope this makes sense. ----- Senior in Tsinghua Univ. github: http://www.github.com/uronce-cc -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10768.html Sent from the Apache Spark User List mailing list archive at Nabble.com.