Re: clojure-hadoop

Sunil S Nandihalli Wed, 09 May 2012 04:54:00 -0700

Hi,
A little digging led me to clojure-hadoop.filesystem which had most of the
context info I was interested in.
Sunil.


On Wed, May 9, 2012 at 2:02 PM, Sunil S Nandihalli <
sunil.nandiha...@gmail.com> wrote:

> Hi Everybody,
>  I have been using clojure-hadoop with out knowing all the nitty-gritties
> of hadoop .. which is a good and a bad thing. It abstracts everything
> except maps and reduces which directly deal with clojure-datastructures
> without worrying about serilization or deserialization.. So, very nice. I
> want to get access to the task-id .. would somebody have a clue as to how I
> can do this from within map or reduce functions? I need it so that I can
> use it to debug failed-tasks. Right now I just append the current time to
> stringified form of the key to obtain a unique file-name during my reduce
> job.. but would be useful to find out how to get the actual job-id from
> inside the map-task.
> Thanks,
> Sunil.
>
> P.S. Just to put things in context .. following is an extract from
> http://wiki.apache.org/hadoop/FAQ
>
> 2.4. Can I write create/write-to hdfs files directly from map/reduce tasks?
>
> Yes. (Clearly, you want this since you need to create/write-to files other
> than the output-file written out by 
> OutputCollector<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html>
> .)
>
> Caveats:
>
> ${mapred.output.dir} is the eventual output directory for the job (
> JobConf.setOutputPath<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)>
>  / 
> JobConf.getOutputPath<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath()>
> ).
>
> ${taskid} is the actual id of the individual task-attempt (e.g.
> task_200709221812_0001_m_000000_0), a TIP is a bunch of ${taskid}s (e.g.
> task_200709221812_0001_m_000000).
>
> With *speculative-execution* *on*, one could face issues with 2 instances
> of the same TIP (running simultaneously) trying to open/write-to the same
> file (path) on hdfs. Hence the app-writer will have to pick unique names
> (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per
> task-attempt, not just per TIP. (Clearly, this needs to be done even if the
> user doesn't create/write-to files directly via reduce tasks.)
>
> To get around this the framework helps the application-writer out by
> maintaining a special *${mapred.output.dir}/_${taskid}* sub-dir for each
> reduce task-attempt on hdfs where the output of the reduce task-attempt
> goes. On successful completion of the task-attempt the files in the
> ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved
> to ${mapred.output.dir}. Of course, the framework discards the
> sub-directory of unsuccessful task-attempts. This is completely transparent
> to the application.
>
> The application-writer can take advantage of this by creating any
> side-files required in ${mapred.output.dir} during execution of his
> reduce-task, and the framework will move them out similarly - thus you
> don't have to pick unique paths per task-attempt.
>
> Fine-print: the value of ${mapred.output.dir} during execution of a
> particular *reduce* task-attempt is actually
> ${mapred.output.dir}/_{$taskid}, not the value set by
> JobConf.setOutputPath<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)>
> . *So, just create any hdfs files you want in ${mapred.output.dir} from
> your reduce task to take advantage of this feature.*
>
> For *map* task attempts, the automatic substitution of
> ${mapred.output.dir}/_${taskid} for* *${mapred.output.dir} does not take
> place. You can still access the map task attempt directory, though, by using
> FileOutputFormat <http://wiki.apache.org/hadoop/FileOutputFormat>
> .getWorkOutputPath(TaskInputOutputContext<http://wiki.apache.org/hadoop/TaskInputOutputContext>).
> Files created there will be dealt with as described above.
>
> The entire discussion holds true for maps of jobs with reducer=NONE (i.e.
> 0 reduces) since output of the map, in that case, goes directly to hdfs.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: clojure-hadoop

Reply via email to