Hi,

looking for some advice from the experts, please. I am new to Hive.

I have data files each consisting of scads of very long records, each record
being an XML doc in it's own right.
These XML docs have a complex structure: something like this

<record>
  <sec1>
      <foo>
         <bar id="asd"><stuff><MORE STUFF></bar>
         <bar ... >
      </foo>
  </sec1>
  <sec2>
  ...
</record>

These are generated by another system, aggregated by flume and dumped into
HDFS.

Anyhoo ... I'd like to load up this entire thing into Hive tables.
Logically, the <sec> sections fit reasonable well into individual tables and
this matches with the sorts of reports and data mining we want to do over
the data.

To start with, writing Java code is not really on. While I speak several
programming languages I am not fluent in Java or proficient in Java
development, so I plan to do any map/reduce steps necessary using Streaming
and Python.

I have been Googling around and it seems to me that either

a) I load up each (huge) XML record into a simple Hive table and then use
some fairly opaque looking XPath stuff to extract data from that into
subsequent usable tables, or
b) I write multiple mapper jobs - no reduce - in Python to split the file
into sensible sections, each one outputting a certain "type" of data, and
ingest those directly into Hive tables.

Do either of these approaches sound sensible? Is there a better approach I
have not considered?

I sort of tend to favour b) : I can see that I'd end up with code which
looked more comprehensible, at least to me.

Any comments on the proper way to to this?

Much appreciated before painting-self-into-corner starts.

regards
Liam

Reply via email to