Hi Steve, I am new to this forum and a buddy on Hadoop.. I have same kind of problem where input file is not able to treated as a text file ..
Cant we do like this , Define our own InputFormat ,InputSplit and RecordReader.. Thanks Vamsi Jeff Zhang-4 wrote: > > Hi Steve, > > When you want to read xml, you should provide your custom InputFormat > which > extends FileInputFormat. > > and override the method isSplitable to not split a file , that means one > xml > file for one mapper. > > > protected boolean isSplitable(FileSystem fs, Path filename) { > return false; > } > > > > Best Regards, > > Jeff zhang > > > > On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <steve....@yahoo.com> wrote: > >> >> Does anybody have the similar issue? If you store XML files in HDFS, how >> can you make sure a chunk reads by a mapper does not contain partial data >> of >> an XML segment? >> >> For example: >> >> <title> >> <book>book1</book> >> <author>me</author> >> ..............what if this is the boundary of a chunk?................... >> <year>2009</year> >> <book>book2</book> >> >> <author>me</author> >> >> <year>2009</year> >> <book>book3</book> >> >> <author>me</author> >> >> <year>2009</year> >> <title> >> >> >> >> >> >> >> > > -- View this message in context: http://old.nabble.com/What-if-an-XML-file-cross-boundary-of-HDFS-chunks--tp26120236p28582046.html Sent from the Hadoop core-dev mailing list archive at Nabble.com.