the value in (key, value) returned by textFile is exactly one line of the input.

But what I want is the field between the two “!!”, hope this makes sense.
================================
常铖 cheng chang
Computer Science Dept. Tsinghua Univ.
Mobile Phone: 13681572414
WeChat ID: cccjcl
================================

在 2014年7月28日 下午5:05:06, Sean Owen (so...@cloudera.com) 写到:

Shouldn't you be using the textFile() method? you are reading the file  
directly using TextInputFormat, and you get the raw (key,value) pairs  
back, which are indeed (line number,line) for TextInputFormat. Your  
second solution is fine if, for some reason, you need to use that  
method.  

On Mon, Jul 28, 2014 at 9:02 AM, chang cheng <myai...@gmail.com> wrote:  
> Hi, all:  
>  
> I have a hadoop file containing fields seperated by "!!", like below:  
> !!  
> field1  
> key1 value1  
> key2 value2  
> !!  
> field2  
> key3 value3  
> key4 value4  
> !!  
>  
> I want to read the file into a pair in TextInputFormat, specifying delimiter  
> as "!!"  
>  
> First, I tried the following code:  
>  
> val hadoopConf = new Configuration()  
> hadoopConf.set("textinputformat.record.delimiter", "!!\n")  
>  
> val path = args(0)  
> val rdd = sc.newAPIHadoopFile(path, classOf[TextInputFormat],
> classOf[LongWritable], classOf[Text], hadoopConf)  
>  
> rdd.take(3).foreach(println)
>  
> Far from expectation, the result is:  
>  
> (120,)  
> (120,)  
> (120,)  
>  
> According to my experimentation, "120" is the byte offset of the last field  
> separated by "!!"  
>  
> After digging into spark source code, I find "textFileInput" is implemented  
> as:  
>  
> hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable],
> classOf[Text],
> minPartitions).map(pair => pair._2.toString).setName(path)
>  
> So, I modified my initial code into: (bold text is the modification)  
>  
> val hadoopConf = new Configuration()  
> hadoopConf.set("textinputformat.record.delimiter", "!!\n")  
>  
> val path = args(0)  
> val rdd = sc.newAPIHadoopFile(path, classOf[TextInputFormat],
> classOf[LongWritable], classOf[Text], hadoopConf).*map(pair =>  
> pair._2.toString)*
>  
> rdd.take(3).foreach(println)
>  
> Then, the results are:  
>  
> filed1  
> key1 value1  
> key2 value2  
>  
> field2  
> ....  
> As expected.  
>  
> I'm confused by the first code snippet's behavior.  
> Hope you can offer an explanation. Thanks!  
>  
>  
>  
> -----  
> Senior in Tsinghua Univ.  
> github: http://www.github.com/uronce-cc  
> --  
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764.html
>   
> Sent from the Apache Spark User List mailing list archive at Nabble.com.  

Reply via email to