So i found this discussion on this topic http://mail-archives.apache.org/mod_mbox/hive-user/201006.mbox/%3caanlktikyl3hinowfo36yeyid9vojyh_6pe3slorhy...@mail.gmail.com%3E. Makes more sense now. Will post my final resolution.
On Sun, Jun 24, 2012 at 10:39 PM, Sumit Kumar <ksumi...@gmail.com> wrote: > Hi, > > So i looked for a generic approach for handling xml files in hive but > found none and thought i could use the concepts from json-serde ( > http://code.google.com/p/hive-json-serde/) in creating a generic xml > serde. XPath was something that came immediately in my mind and should work > in the same way that json works for json-serde. The problem is with the use > case that one xml file could contain multiple rows of interest in a single > xml file. Example shown below. > > <root> > <book> ... </book> > <book> ... </book> > <book> ... </book> > </root> > > In this case, serde is supposed to generate three rows for each book node. > I looked at json-serde implementation but there the deserialize step > returns an ArrayList instance with column values set in indices of the > ArrayList; and this one instance maps to one row. I do see that deserialize > step can return any java Object but not sure what would be the appropriate > way to return multiple rows corresponding to each book node. I'm going to > give it a shot anyway but thought to seek help from the community if > somebody has already tried this or has a better approach. Would really > appreciate any input, if i succeed, i will share my code; if not, i will > anyway come back :-) > > Thanks in advance. > -Sumit >