Hi, A little digging led me to clojure-hadoop.filesystem which had most of the context info I was interested in. Sunil.
On Wed, May 9, 2012 at 2:02 PM, Sunil S Nandihalli < sunil.nandiha...@gmail.com> wrote: > Hi Everybody, > I have been using clojure-hadoop with out knowing all the nitty-gritties > of hadoop .. which is a good and a bad thing. It abstracts everything > except maps and reduces which directly deal with clojure-datastructures > without worrying about serilization or deserialization.. So, very nice. I > want to get access to the task-id .. would somebody have a clue as to how I > can do this from within map or reduce functions? I need it so that I can > use it to debug failed-tasks. Right now I just append the current time to > stringified form of the key to obtain a unique file-name during my reduce > job.. but would be useful to find out how to get the actual job-id from > inside the map-task. > Thanks, > Sunil. > > P.S. Just to put things in context .. following is an extract from > http://wiki.apache.org/hadoop/FAQ > > 2.4. Can I write create/write-to hdfs files directly from map/reduce tasks? > > Yes. (Clearly, you want this since you need to create/write-to files other > than the output-file written out by > OutputCollector<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html> > .) > > Caveats: > > ${mapred.output.dir} is the eventual output directory for the job ( > JobConf.setOutputPath<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)> > / > JobConf.getOutputPath<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath()> > ). > > ${taskid} is the actual id of the individual task-attempt (e.g. > task_200709221812_0001_m_000000_0), a TIP is a bunch of ${taskid}s (e.g. > task_200709221812_0001_m_000000). > > With *speculative-execution* *on*, one could face issues with 2 instances > of the same TIP (running simultaneously) trying to open/write-to the same > file (path) on hdfs. Hence the app-writer will have to pick unique names > (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per > task-attempt, not just per TIP. (Clearly, this needs to be done even if the > user doesn't create/write-to files directly via reduce tasks.) > > To get around this the framework helps the application-writer out by > maintaining a special *${mapred.output.dir}/_${taskid}* sub-dir for each > reduce task-attempt on hdfs where the output of the reduce task-attempt > goes. On successful completion of the task-attempt the files in the > ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved > to ${mapred.output.dir}. Of course, the framework discards the > sub-directory of unsuccessful task-attempts. This is completely transparent > to the application. > > The application-writer can take advantage of this by creating any > side-files required in ${mapred.output.dir} during execution of his > reduce-task, and the framework will move them out similarly - thus you > don't have to pick unique paths per task-attempt. > > Fine-print: the value of ${mapred.output.dir} during execution of a > particular *reduce* task-attempt is actually > ${mapred.output.dir}/_{$taskid}, not the value set by > JobConf.setOutputPath<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)> > . *So, just create any hdfs files you want in ${mapred.output.dir} from > your reduce task to take advantage of this feature.* > > For *map* task attempts, the automatic substitution of > ${mapred.output.dir}/_${taskid} for* *${mapred.output.dir} does not take > place. You can still access the map task attempt directory, though, by using > FileOutputFormat <http://wiki.apache.org/hadoop/FileOutputFormat> > .getWorkOutputPath(TaskInputOutputContext<http://wiki.apache.org/hadoop/TaskInputOutputContext>). > Files created there will be dealt with as described above. > > The entire discussion holds true for maps of jobs with reducer=NONE (i.e. > 0 reduces) since output of the map, in that case, goes directly to hdfs. > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en