Also take a look of http://wiki.apache.org/pig/TuringCompletePig. You
can embed Pig into Python script. This feature already checked in into
trunk and will be available in 0.9.
Daniel
Alex McLintock wrote:
I'm trying to understand the best way of setting up repeated processing of
continuously generated data - like logs.
I can manually copy files from normal FS to HDFS and kick off pig scripts
but ideally I want something automatic - preferably every hour, or possibly
more often. I also want to process a day or a month's worth of data rather
than just the most recent file.
Is there a best practice way of doing this documented anywhere? I believe
that I should be looking at Flume for transferring files into HDFS and Oozie
for some kind of workflow of pig jobs. Is that right? Any example setups?
Cheers
Alex