Currently, we are running our cluster in EC2 with HDFS stored on the local (i.e. transient) disk. We don't want to deal with EBS, because it complicates being able to spin up additional slaves as needed. We're looking at moving to a combination of s3 (block) or s3n for data that we care about, and leaving lower value data that we can recreate on HDFS.
My thinking is that s3n has significant advantages in terms of how easy it is to import data from non-Hadoop processes, and also the ease of sampling data, but I'm not sure how well it actually works. I'm guessing that it wouldn't be able to split files, or maybe it would need to download the entire file from S3 multiple times to split it? Is the issue with writes buffering the entire file on the local machine significant? Our jobs tend to be more CPU intensive than the usual kind of log processing type jobs, so we usually end up with smaller files. Is it feasible to run s3 (block) and hdfs in parallel? Would I need two namenodes to do this? Is this a good idea? Has anyone tried either of these configurations in EC2?
