Re: Confusing behavior of newAPIHadoopFile

Sean Owen Mon, 28 Jul 2014 04:38:20 -0700

Why the call to reduce? that's building up the entire text file as one
string. It seems like you want something more like:


sc.textFile(“path”).map(_.split(“!!”).filter(_.trim.length > 0))

This gives an RDD with an array of non-empty String tokens for each
line of the file.


On Mon, Jul 28, 2014 at 12:32 PM, gmail <myai...@gmail.com> wrote:
> Yes, I can implement like:
>
>
> sc.textFile(“path”).reduce(_ + _).split(“!!”).filter(x => x.trim.length > 0)
>
>
> But the reduce operation is expensive! I tested these two methods on a 6G
> file, the only operation with the created RDD is take(10).foreach(println),
> the method using newAPIHadoopFile only takes 2s while the code above will
> block for more than 1min because of the reduce I think.
>
>
> Would you post the code snippet to illustrate your idea? I didn't come up
> with an easy map, filter operation sequences on the RDD returned by
> textFile. Thanks!
>
> ================================
> 常铖 cheng chang
> Computer Science Dept. Tsinghua Univ.
> Mobile Phone: 13681572414
> WeChat ID: cccjcl
> ================================
>
> 在 2014年7月28日 下午5:40:21, chang cheng (myai...@gmail.com) 写到:
>
> the value in (key, value) returned by textFile is exactly one line of the
> input.
>
> But what I want is the field between the two “!!”, hope this makes sense.
>
>
>
> -----
> Senior in Tsinghua Univ.
> github: http://www.github.com/uronce-cc
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10768.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Confusing behavior of newAPIHadoopFile

Reply via email to