Does anybody have the similar issue? If you store XML files in HDFS, how can you make sure a chunk reads by a mapper does not contain partial data of an XML segment?
For example:
<title>
<book>book1</book>
<author>me</author>
..............what if this is the boundary of a chunk?...................
<year>2009</year>
<book>book2</book>
<author>me</author>
<year>2009</year>
<book>book3</book>
<author>me</author>
<year>2009</year>
<title>
