Hi,

So i looked for a generic approach for handling xml files in hive but found
none and thought i could use the concepts from json-serde (
http://code.google.com/p/hive-json-serde/) in creating a generic xml serde.
XPath was something that came immediately in my mind and should work in the
same way that json works for json-serde. The problem is with the use case
that one xml file could contain multiple rows of interest in a single xml
file. Example shown below.

<root>
 <book> ... </book>
 <book> ... </book>
 <book> ... </book>
</root>

In this case, serde is supposed to generate three rows for each book node.
I looked at json-serde implementation but there the deserialize step
returns an ArrayList instance with column values set in indices of the
ArrayList; and this one instance maps to one row. I do see that deserialize
step can return any java Object but not sure what would be the appropriate
way to return multiple rows corresponding to each book node. I'm going to
give it a shot anyway but thought to seek help from the community if
somebody has already tried this or has a better approach. Would really
appreciate any input, if i succeed, i will share my code; if not, i will
anyway come back :-)

Thanks in advance.
-Sumit

Reply via email to