Currently, we are running our cluster in EC2 with HDFS stored on the local
(i.e. transient) disk. We don't want to deal with EBS, because it
complicates being able to spin up additional slaves as needed. We're looking
at moving to a combination of s3 (block) or s3n for data that we care about,
and leaving lower value data that we can recreate on HDFS.

My thinking is that s3n has significant advantages in terms of how easy it
is to import data from non-Hadoop processes, and also the ease of sampling
data, but I'm not sure how well it actually works. I'm guessing that it
wouldn't be able to split files, or maybe it would need to download the
entire file from S3 multiple times to split it? Is the issue with writes
buffering the entire file on the local machine significant? Our jobs tend to
be more CPU intensive than the usual kind of log processing type jobs, so we
usually end up with smaller files.

Is it feasible to run s3 (block) and hdfs in parallel? Would I need two
namenodes to do this? Is this a good idea?

Has anyone tried either of these configurations in EC2?

Reply via email to