Whoops.  I answered to the wrong list as well.  Sorry for the cross-post.

Alex

On Wed, May 27, 2009 at 12:39 PM, Alex Loddengaard <a...@cloudera.com>wrote:

> Answers in-line.
>
> Alex
>
> On Wed, May 27, 2009 at 6:50 AM, Patrick Angeles <patrickange...@gmail.com
> > wrote:
>
>> Hey all,
>>
>> I'm trying to find some up-to-date hardware advice for building a Hadoop
>> cluster. I've only been able to dig up the following links. Given Moore's
>> law, these are already out of date:
>>
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200811.mbox/%3ca47c361b-d19b-4a61-8dc1-41d4c0975...@cse.unl.edu%3e
>> http://wiki.apache.org/hadoop/MachineScaling
>> We expect to be taking in roughly 50GB of log data per day. In the early
>> going, we can choose to retain the logs for only a short period after
>> processing, so we can start with a small cluster (around 6 task nodes).
>> However, at some point, we will want to retain up to a year's worth of raw
>> data (~14TB per year).
>>
>> We will likely be using Hive/Pig and Mahout for cluster analysis.
>>
>> Given this, I'd like to run by the following machine specs to see what
>> everyone thinks:
>>
>> 2 x Hadoop Master (and Secondary NameNode)
>>
>>   - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W)
>>   - 16GB DDR2-800 Registered ECC Memory
>>   - 4 x 1TB 7200rpm SATA II Drives
>>   - Hardware RAID controller
>>   - Redundant Power Supply
>>   - Approx. 390W power draw (1.9amps 208V)
>>   - Approx. $4000 per unit
>>
>> 6 x Hadoop Task Nodes
>>
>>   - 1 x 2.3Ghz Quad Core (Opteron 1356)
>>   - 8GB DDR2-800 Registered ECC Memory
>>   - 4 x 1TB 7200rpm SATA II Drives
>>   - No RAID (JBOD)
>>   - Non-Redundant Power Supply
>>   - Approx. 210W power draw (1.0amps 208V)
>>   - Approx. $2000 per unit
>
> If you can swing it, I'd recommend going with eight cores and 16 GBs of
> memory, unless you expect your jobs to be IO bound.  Really the ratio of
> disks to CPU+RAM should be matched to the types of jobs you're running.
> That said, doubling your cores and memory gives you a little more breathing
> room, and is relatively cheap in the grand scheme of things.
>
>>
>>
>> I had some specific questions regarding this configuration...
>>
>>   1. Is hardware RAID necessary for the master node?
>
> No.  Just make sure you configure Hadoop to write NN metadata to each disk,
> including a NFS mount.  The NN will write in parallel to all its configured
> directories.
>
>>
>>   2. What is a good processor-to-storage ratio for a task node with 4TB of
>>   raw storage? (The config above has 1 core per 1TB of raw storage.)
>
> Again,I would recommend doubling your cores and memory.
>
>>
>>   3. Am I better off using dual quads for a task node, with a higher power
>>   draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly
>> $3200,
>>   but draws almost 2x as much power. The tradeoffs are:
>>      1. I will get more CPU per dollar and per watt.
>>      2. I will only be able to fit 1/2 as much dual quad machines into a
>>      rack.
>>      3. I will get 1/2 the storage capacity per watt.
>>      4. I will get less I/O throughput overall (less spindles per core)
>>   4. In planning storage capacity, how much spare disk space should I take
>>   into account for 'scratch'? For now, I'm assuming 1x the input data
>> size.
>
> What do you define as scratch?  Do you mean mapper intermediate data?  If
> that's what you mean, then you should assume a fair amount.  If you install
> your OS on one partition, and devote all other partitions and disks to HDFS,
> Hadoop will do something reasonable with regard to DFS data, MR temporary
> data, etc.
>
>>
>>
>> Thanks in advance,
>>
>> - P
>>
>
>

Reply via email to