Re: Moving TB of data from NFS to HDFS

Rajiv Chittajallu Wed, 25 Jan 2012 04:29:37 -0800

You will more likely be hitting NFS server limits way before you can see any 
noticible issues with HDFS.


Writes to a file are sequential. Total throughput for your transfer is 
dependent on number of files and the rate at which files can be read from
NFS. If the total data set is split across reasonable number of files, say 2G, 
Upload rate can be matched to the NFS server limits. 

On a small cluster, mounting the filesystem via NSF and using distcp with input 
path as file:///<path> would work. 

Another option is making your files available via HTTP and runnin a simple 
streaming job to parallelize the data pull.

It basically comes down to how you want to initiate the parallel copies.

-rajive

On Jan 25, 2012, at 1:19, Ajit Ratnaparkhi <[email protected]> wrote:

> Hi raj,
> 
> If you have all data on NFS mounted disk, meaning on single machine, then
> your upload will be limited by network bandwidth. You can try running dfs
> -put in multiple parallel threads for distinct data sets, you might be able
> to utilise network bandwidth to its maximum(take care not to have too many
> threads otherwise namenode handlers will be busy all the time making dfs
> unresponsive). I dont see any other way to make it faster, making data
> upload faster require data source to be present at distributed locations
> which is not true in this case.
> 
> -Ajit
> 
> 
> On Wed, Jan 25, 2012 at 10:46 AM, Praveen Sripati
> <[email protected]>wrote:
> 
>>> If it is divided up into several files and you can mount your NFS
>> directory on each of the datanodes.
>> 
>> Just curious, how will this help.
>> 
>> Praveen
>> 
>> On Wed, Jan 25, 2012 at 12:39 AM, Robert Evans <[email protected]>
>> wrote:
>> 
>>> If it is divided up into several files and you can mount your NFS
>>> directory on each of the datanodes, you could possibly use distcp to do
>> it.
>>> I have never tried using distcp for this, but it should work.  Or you
>> can
>>> write your own streaming Map/Reduce script that does more or less the
>> same
>>> thing as distcp and will take as input the list of files to copy, and
>> will
>>> do a hadoop fs -put for each file having it come from NFS.
>>> 
>>> --Bobby Evans
>>> 
>>> On 1/24/12 12:49 AM, "rajmca2002" <[email protected]> wrote:
>>> 
>>> 
>>> 
>>> Hi,
>>> 
>>> I have TB of Data in NFS i need to move this data to hdfs. I have used
>>> hadoop put command to do the same, but it resulted in taking hours to
>> place
>>> the file in HDFS, Is there any good approach to move large files to hdfs.
>>> 
>>> Please reply asap.
>>> --
>>> View this message in context:
>>> 
>> http://old.nabble.com/Moving-TB-of-data-from-NFS-to-HDFS-tp33193061p33193061.html
>>> Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>>> 
>>> 
>>> 
>>

Re: Moving TB of data from NFS to HDFS

Reply via email to