Sorry for cross-posting, I realized I sent the following to the hbase list
when it's really more a Hadoop question.

---------- Forwarded message ----------
From: Patrick Angeles <patrickange...@gmail.com>
Date: Wed, May 27, 2009 at 9:50 AM
Subject: hadoop hardware configuration
To: hbase-u...@hadoop.apache.org


Hey all,

I'm trying to find some up-to-date hardware advice for building a Hadoop
cluster. I've only been able to dig up the following links. Given Moore's
law, these are already out of date:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200811.mbox/%3ca47c361b-d19b-4a61-8dc1-41d4c0975...@cse.unl.edu%3e
http://wiki.apache.org/hadoop/MachineScaling
We expect to be taking in roughly 50GB of log data per day. In the early
going, we can choose to retain the logs for only a short period after
processing, so we can start with a small cluster (around 6 task nodes).
However, at some point, we will want to retain up to a year's worth of raw
data (~14TB per year).

We will likely be using Hive/Pig and Mahout for cluster analysis.

Given this, I'd like to run by the following machine specs to see what
everyone thinks:

2 x Hadoop Master (and Secondary NameNode)

   - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W)
   - 16GB DDR2-800 Registered ECC Memory
   - 4 x 1TB 7200rpm SATA II Drives
   - Hardware RAID controller
   - Redundant Power Supply
   - Approx. 390W power draw (1.9amps 208V)
   - Approx. $4000 per unit

6 x Hadoop Task Nodes

   - 1 x 2.3Ghz Quad Core (Opteron 1356)
   - 8GB DDR2-800 Registered ECC Memory
   - 4 x 1TB 7200rpm SATA II Drives
   - No RAID (JBOD)
   - Non-Redundant Power Supply
   - Approx. 210W power draw (1.0amps 208V)
   - Approx. $2000 per unit

I had some specific questions regarding this configuration...

   1. Is hardware RAID necessary for the master node?
   2. What is a good processor-to-storage ratio for a task node with 4TB of
   raw storage? (The config above has 1 core per 1TB of raw storage.)
   3. Am I better off using dual quads for a task node, with a higher power
   draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly $3200,
   but draws almost 2x as much power. The tradeoffs are:
      1. I will get more CPU per dollar and per watt.
      2. I will only be able to fit 1/2 as much dual quad machines into a
      rack.
      3. I will get 1/2 the storage capacity per watt.
      4. I will get less I/O throughput overall (less spindles per core)
   4. In planning storage capacity, how much spare disk space should I take
   into account for 'scratch'? For now, I'm assuming 1x the input data size.

Thanks in advance,

- P

Reply via email to