Hey Tim,

Why don't you put the PNGs in a SequenceFile in the output of your reduce task? You could then have a post-processing step that unpacks the PNG and places it onto S3. (If my numbers are correct, you're looking at around 3TB of data; is this right? With that much, you might want another separate Map task to unpack all the files in parallel ... really depends on the throughput you get to Amazon)

Brian

On Apr 14, 2009, at 4:35 AM, tim robertson wrote:

Hi all,

I am currently processing a lot of raw CSV data and producing a
summary text file which I load into mysql.  On top of this I have a
PHP application to generate tiles for google mapping (sample tile:
http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
Here is a (dev server) example of the final map client:
http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
dynamic grids as you zoom are all pre-calculated.

I am considering (for better throughput as maps generate huge request
volumes) pregenerating all my tiles (PNG) and storing them in S3 with
cloudfront.  There will be billions of PNGs produced each at 1-3KB
each.

Could someone please recommend the best place to generate the PNGs and
when to push them to S3 in a MR system?
If I did the PNG generation and upload to S3 in the reduce the same
task on multiple machines will compete with each other right?  Should
I generate the PNGs to a local directory and then on Task success push
the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
good idea.

I will use EC2 for the MR for the time being, but this will be moved
to a local cluster still pushing to S3...

Cheers,

Tim

Reply via email to