[
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Colin Patrick McCabe resolved HDFS-1742.
----------------------------------------
Resolution: Duplicate
> Provide hooks / callbacks to execute some code based on events happening in
> HDFS (file / directory creation, opening, closing, etc)
> -----------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-1742
> URL: https://issues.apache.org/jira/browse/HDFS-1742
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: namenode
> Reporter: Mikhail Yakshin
> Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on
> the data that appears in HDFS: for example, we have a job that works on day's
> worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it
> should wait for directory with externally uploaded data as
> {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to
> appear, i.e. {{/output/YYYY/MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for
> files/directories we're waiting for, but generally it's a bad solution. The
> better one is something like file alteration monitor or [inode activity
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
> public void onFileCreate(SomeFileInformation f);
> public void onFileClose(SomeFileInformation f);
> public void onFileDelete(SomeFileInformation f);
> ...
> }
> {code}
> It might be possible to creates a class that implements this method and load
> it somehow (for example, using an extra jar in classpath) in NameNode's JVM.
> NameNode includes a configuration option that specifies names of such
> class(es) - then NameNode instantiates them and calls methods from them (in a
> separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a
> class that would be most likely distributed as contrib. Default NameNode's
> process would stay the same without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable
> Scheduler interfaces, such as [Fair
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
> [Capacity
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
> [Dynamic
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
> etc. It also uses a class(es) that loads and runs inside JobTracker's
> context, few relatively trustued varieties exist, they're distributed as
> contrib and purely optional to be enabled by cluster admin.
> This would allow systems such as I've described in the beginning to be
> implemented without polling.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)