Well, if there is no known solution, maybe extend regexserde ... On Tue, Mar 29, 2011 at 5:36 PM, Michael Jiang <it.mjji...@gmail.com> wrote:
> hey guys, > > I want to extract some information from an apache web log. It does more > than just extracting fixed fields that appear at certain location such as > host and request. One task is to extract multiple key/value pairs in request > string. For example, in request string, I have parameters like "name.0", > "name.1", ..., "name.n". Here "n" can be any valid non-negative integer. > They may appear anywhere in the request. It's not just to extract each > key/value pair. More than that :) I want to clone the entry line "n" times > if it contains "name.i" n times, each "ith" cloned entry has an extra field > with the value of "name.i". > > I can load log and extract request string first into a table. Then write a > script to do streaming to extract "name" key/value and write to stdout "n" > cloned entries. But is there a one step solution to extract them all from > log file and generate multiple entries as well? I know > "org.apache.hadoop.hive.contrib.serde2.RegexSerDe" can load and extract > apache web log. Is it possible to use it for this case? Thanks! > > --mj >