Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Raghu Angadi Thu, 12 Mar 2009 14:45:33 -0700

TCK wrote:

How well does the read throughput from HDFS scale with the number of data nodes 
?
For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client processes requesting different parts of the file in parallel)

be halved if I had the same file on a 20 data node cluster ?

depends: yes, if whatever was bottleneck with 10 still continues to bebottleneck (i.e. you are able to saturate in both cases) and thatresource is scaled (disk or network)

Is this not possible because HDFS doesn't support random seeks?

HDFS does support random seeks for reading... your case should work.

Raghu.

What about if the file was split up into multiple smaller files before placing 
in the HDFS ?
Thanks for your input.
-TCK




--- On Wed, 2/4/09, Brian Bockelman <[email protected]> wrote:
From: Brian Bockelman <[email protected]>
Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
To: [email protected]
Date: Wednesday, February 4, 2009, 1:50 PM

Sounds overly complicated.  Complicated usually leads to mistakes :)

What about just having a single cluster and only running the tasktrackers on
the fast CPUs?  No messy cross-cluster transferring.

Brian

On Feb 4, 2009, at 12:46 PM, TCK wrote:


Thanks, Brian. This sounds encouraging for us.

What are the advantages/disadvantages of keeping a persistent storage

(HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?

The advantage I can think of is that a permanent storage cluster has

different requirements from a map-reduce processing cluster -- the permanent
storage cluster would need faster, bigger hard disks, and would need to grow as
the total volume of all collected logs grows, whereas the processing cluster
would need fast CPUs and would only need to grow with the rate of incoming data.
So it seems to make sense to me to copy a piece of data from the permanent
storage cluster to the processing cluster only when it needs to be processed. Is
my line of thinking reasonable? How would this compare to running the map-reduce
processing on same cluster as the data is stored in? Which approach is used by
most people?

Best Regards,
TCK



--- On Wed, 2/4/09, Brian Bockelman <[email protected]> wrote:
From: Brian Bockelman <[email protected]>
Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel

reads?

To: [email protected]
Date: Wednesday, February 4, 2009, 1:06 PM

Hey TCK,

We use HDFS+FUSE solely as a storage solution for a application which
doesn't understand MapReduce.  We've scaled this solution to

around

80Gbps.  For 300 processes reading from the same file, we get about

20Gbps.

Do consider your data retention policies -- I would say that Hadoop as a
storage system is thus far about 99% reliable for storage and is not a

backup

solution.  If you're scared of getting more than 1% of your logs lost,

have

a good backup solution.  I would also add that when you are learning your
operational staff's abilities, expect even more data loss.  As you

gain

experience, data loss goes down.

I don't believe we've lost a single block in the last month, but

it

took us 2-3 months of 1%-level losses to get here.

Brian

On Feb 4, 2009, at 11:51 AM, TCK wrote:

Hey guys,

We have been using Hadoop to do batch processing of logs. The logs get

written and stored on a NAS. Our Hadoop cluster periodically copies a

batch of

new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
copies the output back to the NAS. The HDFS is cleaned up at the end of

each

batch (ie, everything in it is deleted).

The problem is that reads off the NAS via NFS don't scale even if

we

try to scale the copying process by adding more threads to read in

parallel.

If we instead stored the log files on an HDFS cluster (instead of

NAS), it

seems like the reads would scale since the data can be read from multiple

data

nodes at the same time without any contention (except network IO, which
shouldn't be a problem).

I would appreciate if anyone could share any similar experience they

have

had with doing parallel reads from a storage HDFS.

Also is it a good idea to have a separate HDFS for storage vs for

doing

the batch processing ?

Best Regards,
TCK

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Reply via email to