Why the call to reduce? that's building up the entire text file as one string. It seems like you want something more like:
sc.textFile(“path”).map(_.split(“!!”).filter(_.trim.length > 0)) This gives an RDD with an array of non-empty String tokens for each line of the file. On Mon, Jul 28, 2014 at 12:32 PM, gmail <myai...@gmail.com> wrote: > Yes, I can implement like: > > > sc.textFile(“path”).reduce(_ + _).split(“!!”).filter(x => x.trim.length > 0) > > > But the reduce operation is expensive! I tested these two methods on a 6G > file, the only operation with the created RDD is take(10).foreach(println), > the method using newAPIHadoopFile only takes 2s while the code above will > block for more than 1min because of the reduce I think. > > > Would you post the code snippet to illustrate your idea? I didn't come up > with an easy map, filter operation sequences on the RDD returned by > textFile. Thanks! > > ================================ > 常铖 cheng chang > Computer Science Dept. Tsinghua Univ. > Mobile Phone: 13681572414 > WeChat ID: cccjcl > ================================ > > 在 2014年7月28日 下午5:40:21, chang cheng (myai...@gmail.com) 写到: > > the value in (key, value) returned by textFile is exactly one line of the > input. > > But what I want is the field between the two “!!”, hope this makes sense. > > > > ----- > Senior in Tsinghua Univ. > github: http://www.github.com/uronce-cc > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10768.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.