Sorry for cross-posting, I realized I sent the following to the hbase list when it's really more a Hadoop question.
---------- Forwarded message ---------- From: Patrick Angeles <patrickange...@gmail.com> Date: Wed, May 27, 2009 at 9:50 AM Subject: hadoop hardware configuration To: hbase-u...@hadoop.apache.org Hey all, I'm trying to find some up-to-date hardware advice for building a Hadoop cluster. I've only been able to dig up the following links. Given Moore's law, these are already out of date: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200811.mbox/%3ca47c361b-d19b-4a61-8dc1-41d4c0975...@cse.unl.edu%3e http://wiki.apache.org/hadoop/MachineScaling We expect to be taking in roughly 50GB of log data per day. In the early going, we can choose to retain the logs for only a short period after processing, so we can start with a small cluster (around 6 task nodes). However, at some point, we will want to retain up to a year's worth of raw data (~14TB per year). We will likely be using Hive/Pig and Mahout for cluster analysis. Given this, I'd like to run by the following machine specs to see what everyone thinks: 2 x Hadoop Master (and Secondary NameNode) - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W) - 16GB DDR2-800 Registered ECC Memory - 4 x 1TB 7200rpm SATA II Drives - Hardware RAID controller - Redundant Power Supply - Approx. 390W power draw (1.9amps 208V) - Approx. $4000 per unit 6 x Hadoop Task Nodes - 1 x 2.3Ghz Quad Core (Opteron 1356) - 8GB DDR2-800 Registered ECC Memory - 4 x 1TB 7200rpm SATA II Drives - No RAID (JBOD) - Non-Redundant Power Supply - Approx. 210W power draw (1.0amps 208V) - Approx. $2000 per unit I had some specific questions regarding this configuration... 1. Is hardware RAID necessary for the master node? 2. What is a good processor-to-storage ratio for a task node with 4TB of raw storage? (The config above has 1 core per 1TB of raw storage.) 3. Am I better off using dual quads for a task node, with a higher power draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly $3200, but draws almost 2x as much power. The tradeoffs are: 1. I will get more CPU per dollar and per watt. 2. I will only be able to fit 1/2 as much dual quad machines into a rack. 3. I will get 1/2 the storage capacity per watt. 4. I will get less I/O throughput overall (less spindles per core) 4. In planning storage capacity, how much spare disk space should I take into account for 'scratch'? For now, I'm assuming 1x the input data size. Thanks in advance, - P