Mixing s3, s3n and hdfs

Kevin Peterson Fri, 08 May 2009 00:55:52 -0700

Currently, we are running our cluster in EC2 with HDFS stored on the local
(i.e. transient) disk. We don't want to deal with EBS, because it
complicates being able to spin up additional slaves as needed. We're looking
at moving to a combination of s3 (block) or s3n for data that we care about,
and leaving lower value data that we can recreate on HDFS.


My thinking is that s3n has significant advantages in terms of how easy it
is to import data from non-Hadoop processes, and also the ease of sampling
data, but I'm not sure how well it actually works. I'm guessing that it
wouldn't be able to split files, or maybe it would need to download the
entire file from S3 multiple times to split it? Is the issue with writes
buffering the entire file on the local machine significant? Our jobs tend to
be more CPU intensive than the usual kind of log processing type jobs, so we
usually end up with smaller files.

Is it feasible to run s3 (block) and hdfs in parallel? Would I need two
namenodes to do this? Is this a good idea?

Has anyone tried either of these configurations in EC2?

Mixing s3, s3n and hdfs

Reply via email to