Shouldn't you be using the textFile() method? you are reading the file
directly using TextInputFormat, and you get the raw (key,value) pairs
back, which are indeed (line number,line) for TextInputFormat. Your
second solution is fine if, for some reason, you need to use that
method.

On Mon, Jul 28, 2014 at 9:02 AM, chang cheng <myai...@gmail.com> wrote:
> Hi, all:
>
> I have a hadoop file containing fields seperated by "!!", like below:
> !!
> field1
> key1 value1
> key2 value2
> !!
> field2
> key3 value3
> key4 value4
> !!
>
> I want to read the file into a pair in TextInputFormat, specifying delimiter
> as "!!"
>
> First, I tried the following code:
>
>     val hadoopConf = new Configuration()
>     hadoopConf.set("textinputformat.record.delimiter", "!!\n")
>
>     val path = args(0)
>     val rdd = sc.newAPIHadoopFile(path, classOf[TextInputFormat],
>       classOf[LongWritable], classOf[Text], hadoopConf)
>
>     rdd.take(3).foreach(println)
>
> Far from expectation, the result is:
>
>     (120,)
>     (120,)
>     (120,)
>
> According to my experimentation, "120" is the byte offset of the last field
> separated by "!!"
>
> After digging into spark source code, I find "textFileInput" is implemented
> as:
>
>      hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable],
> classOf[Text],
>       minPartitions).map(pair => pair._2.toString).setName(path)
>
> So, I modified my initial code into: (bold text is the modification)
>
>     val hadoopConf = new Configuration()
>     hadoopConf.set("textinputformat.record.delimiter", "!!\n")
>
>     val path = args(0)
>     val rdd = sc.newAPIHadoopFile(path, classOf[TextInputFormat],
>       classOf[LongWritable], classOf[Text], hadoopConf).*map(pair =>
> pair._2.toString)*
>
>     rdd.take(3).foreach(println)
>
> Then, the results are:
>
>     filed1
>     key1 value1
>     key2 value2
>
>     field2
>     ....
> As expected.
>
> I'm confused by the first code snippet's behavior.
> Hope you can offer an explanation. Thanks!
>
>
>
> -----
> Senior in Tsinghua Univ.
> github: http://www.github.com/uronce-cc
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to