Hi, So i looked for a generic approach for handling xml files in hive but found none and thought i could use the concepts from json-serde ( http://code.google.com/p/hive-json-serde/) in creating a generic xml serde. XPath was something that came immediately in my mind and should work in the same way that json works for json-serde. The problem is with the use case that one xml file could contain multiple rows of interest in a single xml file. Example shown below.
<root> <book> ... </book> <book> ... </book> <book> ... </book> </root> In this case, serde is supposed to generate three rows for each book node. I looked at json-serde implementation but there the deserialize step returns an ArrayList instance with column values set in indices of the ArrayList; and this one instance maps to one row. I do see that deserialize step can return any java Object but not sure what would be the appropriate way to return multiple rows corresponding to each book node. I'm going to give it a shot anyway but thought to seek help from the community if somebody has already tried this or has a better approach. Would really appreciate any input, if i succeed, i will share my code; if not, i will anyway come back :-) Thanks in advance. -Sumit