On 2013-01-17 23:11, Jakub Glapa wrote:

Hi Jakub,

my pig script is going to produce a set of files that will be an input for a different process. The script would be running periodically so the number
of files would be growing.
I would like to implement an expiry mechanism were I could remove files that are older than x or the number of files has reached some threshold.

I know a crazy way were in bash script you can call "hadoop fs -ls ...",
parse the output and then execute "rmr" on matching entries.

Is there a "human" way to do this from under python script? Pig.fs()

I had the same issue than you few months ago. The public Pig scripting API only exposes a FsShell object which is way too limited to do any real work. However it is possible to get access to the Hadoop FileSystem API from a Python script:


def get_fs():
    """Return a org.apache.hadoop.fs.FileSystem instance."""
    # Pig scripting API exports a FsShell but not a FileSystem object.
    ctx   = ScriptPigContext.get()
    props = ctx.getPigContext().getProperties()
    conf  = ConfigurationUtil.toConfiguration(props)
    fs    = FileSystem.get(conf)
    return fs


Once you have a FileSystem object you can do whatever you want using the standard Hadoop API.


Hope this helps.

-- Clément

Reply via email to