Hi, looking for some advice from the experts, please. I am new to Hive.
I have data files each consisting of scads of very long records, each record being an XML doc in it's own right. These XML docs have a complex structure: something like this <record> <sec1> <foo> <bar id="asd"><stuff><MORE STUFF></bar> <bar ... > </foo> </sec1> <sec2> ... </record> These are generated by another system, aggregated by flume and dumped into HDFS. Anyhoo ... I'd like to load up this entire thing into Hive tables. Logically, the <sec> sections fit reasonable well into individual tables and this matches with the sorts of reports and data mining we want to do over the data. To start with, writing Java code is not really on. While I speak several programming languages I am not fluent in Java or proficient in Java development, so I plan to do any map/reduce steps necessary using Streaming and Python. I have been Googling around and it seems to me that either a) I load up each (huge) XML record into a simple Hive table and then use some fairly opaque looking XPath stuff to extract data from that into subsequent usable tables, or b) I write multiple mapper jobs - no reduce - in Python to split the file into sensible sections, each one outputting a certain "type" of data, and ingest those directly into Hive tables. Do either of these approaches sound sensible? Is there a better approach I have not considered? I sort of tend to favour b) : I can see that I'd end up with code which looked more comprehensible, at least to me. Any comments on the proper way to to this? Much appreciated before painting-self-into-corner starts. regards Liam