Re: Multiple HDFS clients

Usman Waheed Fri, 01 May 2009 13:03:59 -0700

Hi Todd,

Thank You for your input. Our data is like any apache log file(s). Basiclogging info which we are parsing.

Our data is alot which is why we are using HADOOP :).

I will look into running TT's on the hdfs clients just for job processingand not to store any data locally. We can then also gage performance byrunning the MAP/REDUCE both.


Thanks,
Usman

On Fri, May 1, 2009 at 4:22 AM, Usman Waheed <[email protected]> wrote:
Hi,

I just wanted to share a test we conducted in our small cluster of 3
datanodes and one namenode. Basically we have lots of data to processand werun a parsing script outside hadoop that creates the key,value pairs.This
output which is plain txt files is then imported into hadoop using the
put/get etc commands.
Thanks for sharing, Usman. Some comments below:
In order to speed up things we run the parsing jobs on multiplemachines inparallel which are not part of our cluster (3 datanodes + namenode) buttheydo have the same version of hadoop installed as the cluster which weuse to
perform the puts. This work flow has significantly improved our time to
import the data into HADOOP after which we run the reduce-only step to
aggregate.

Currently the way to insert data is through our namenode which all the
machines outsude the cluster call them hdfs clients connect to and arenot
part of the master/slave setup. I haven't tried but maybe we can perform
these puts via the datanodes themselves and not just through thenamenode?
Right now the namenode is the single point through which the hdfs client
machines insert the parsed data.
When you do an hdfs put "to the namenode" the actual data transfer goestothe datanodes anyway - the namenode isn't a data bottleneck. It's justused
to allocate block locations for writers, and then DFSClient connects
directly to the DNs to transfer.
Secondly i would assume that this is a safe way to import parsed datainto
hadoop before we aggregate and will most likely not cause any data
corruption in HDFS. Granted anything can happen :).
Yep, this should be as safe as any other method.
It would be interesting to import our logs and perform the mapping step
inside HADOOP versus doing it outside. I wonder if the performance willbebetter, worse or the same. Yes this is dependent on many factors andone ofthem is the amount of datanodes, data to process, hardware etc we havebutwe are limited. We are trying to utilize machines outside the clusterwhichare idle and can process info and then insert the output into HADOOPHDFS
via puts.
The performance is probably comparable on a small cluster like this,
depending on what the ratio of input/output data is. The advantage ofdoingthe writes from within the cluster is that the namenode will try toallocateblock locations on the local node, so there is less total transfer intothe
cluster. In a larger cluster, you might end up with a network bottleneck
going into the cluster, but with three nodes and any reasonable switchyou
shouldn't be running into that.
What does your input data look like? It might make more sense to uploaditdirectly to the cluster and then use a MR job to perform thetransformation.This way you don't have to worry about doing that distribution yourself.If
you want to make use of those "extra" nodes that aren't part of your
cluster, you could probably just run TTs on them without running DNs.

-Todd




--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Re: Multiple HDFS clients

Reply via email to