Whoops. I answered to the wrong list as well. Sorry for the cross-post. Alex
On Wed, May 27, 2009 at 12:39 PM, Alex Loddengaard <a...@cloudera.com>wrote: > Answers in-line. > > Alex > > On Wed, May 27, 2009 at 6:50 AM, Patrick Angeles <patrickange...@gmail.com > > wrote: > >> Hey all, >> >> I'm trying to find some up-to-date hardware advice for building a Hadoop >> cluster. I've only been able to dig up the following links. Given Moore's >> law, these are already out of date: >> >> >> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200811.mbox/%3ca47c361b-d19b-4a61-8dc1-41d4c0975...@cse.unl.edu%3e >> http://wiki.apache.org/hadoop/MachineScaling >> We expect to be taking in roughly 50GB of log data per day. In the early >> going, we can choose to retain the logs for only a short period after >> processing, so we can start with a small cluster (around 6 task nodes). >> However, at some point, we will want to retain up to a year's worth of raw >> data (~14TB per year). >> >> We will likely be using Hive/Pig and Mahout for cluster analysis. >> >> Given this, I'd like to run by the following machine specs to see what >> everyone thinks: >> >> 2 x Hadoop Master (and Secondary NameNode) >> >> - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W) >> - 16GB DDR2-800 Registered ECC Memory >> - 4 x 1TB 7200rpm SATA II Drives >> - Hardware RAID controller >> - Redundant Power Supply >> - Approx. 390W power draw (1.9amps 208V) >> - Approx. $4000 per unit >> >> 6 x Hadoop Task Nodes >> >> - 1 x 2.3Ghz Quad Core (Opteron 1356) >> - 8GB DDR2-800 Registered ECC Memory >> - 4 x 1TB 7200rpm SATA II Drives >> - No RAID (JBOD) >> - Non-Redundant Power Supply >> - Approx. 210W power draw (1.0amps 208V) >> - Approx. $2000 per unit > > If you can swing it, I'd recommend going with eight cores and 16 GBs of > memory, unless you expect your jobs to be IO bound. Really the ratio of > disks to CPU+RAM should be matched to the types of jobs you're running. > That said, doubling your cores and memory gives you a little more breathing > room, and is relatively cheap in the grand scheme of things. > >> >> >> I had some specific questions regarding this configuration... >> >> 1. Is hardware RAID necessary for the master node? > > No. Just make sure you configure Hadoop to write NN metadata to each disk, > including a NFS mount. The NN will write in parallel to all its configured > directories. > >> >> 2. What is a good processor-to-storage ratio for a task node with 4TB of >> raw storage? (The config above has 1 core per 1TB of raw storage.) > > Again,I would recommend doubling your cores and memory. > >> >> 3. Am I better off using dual quads for a task node, with a higher power >> draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly >> $3200, >> but draws almost 2x as much power. The tradeoffs are: >> 1. I will get more CPU per dollar and per watt. >> 2. I will only be able to fit 1/2 as much dual quad machines into a >> rack. >> 3. I will get 1/2 the storage capacity per watt. >> 4. I will get less I/O throughput overall (less spindles per core) >> 4. In planning storage capacity, how much spare disk space should I take >> into account for 'scratch'? For now, I'm assuming 1x the input data >> size. > > What do you define as scratch? Do you mean mapper intermediate data? If > that's what you mean, then you should assume a fair amount. If you install > your OS on one partition, and devote all other partitions and disks to HDFS, > Hadoop will do something reasonable with regard to DFS data, MR temporary > data, etc. > >> >> >> Thanks in advance, >> >> - P >> > >