If you pack your images into sequence files, as the value items, the cluster
will automatically do a decent job of ensuring that the input splits made
from the sequences files are local to the map task.

We did this in production at a previous job and it worked very well for us.
Might as well turn off sequence file compression unless you are passing raw
images, or have substantial amounts of compressible meta data.

Do remember to drop the images from the output records passed to the reduce
phase if you have to have one, or the reduce will be expensive.

On Sun, Apr 12, 2009 at 11:13 PM, Sharad Agarwal <[email protected]>wrote:

>
> Sameer Tilak wrote:
> >
> > Hi everyone,
> > I would like to use Hadoop for analyzing tens of thousands of images.
> > Ideally each mapper gets few hundred images to process and I'll have few
> > hundred mappers. However, I want the mapper function to run on the
> machine
> > where its images are stored. How can I achieve that. With text data
> creating
> > splits and exploiting locality seems easy.
> You can store the image files in hdfs. However storing too many small files
> in hdfs
> will result in scalability and performance issues. So you can combine
> multiple image files
> into a sequence file. There are some other approaches also discussed here:
> http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Reply via email to