Also take a look of http://wiki.apache.org/pig/TuringCompletePig. You can embed Pig into Python script. This feature already checked in into trunk and will be available in 0.9.

Daniel

Alex McLintock wrote:
I'm trying to understand the best way of setting up repeated processing of
continuously generated data - like logs.

I can manually copy files from normal FS to HDFS and kick off pig scripts
but ideally I want something automatic - preferably every hour, or possibly
more often. I also want to process a day or a month's worth of data rather
than just the most recent file.

Is there a best practice way of doing this documented anywhere? I believe
that I should be looking at Flume for transferring files into HDFS and Oozie
for some kind of workflow of pig jobs. Is that right? Any example setups?

Cheers

Alex

Reply via email to