Hey guys, 

We have been using Hadoop to do batch processing of logs. The logs get written 
and stored on a NAS. Our Hadoop cluster periodically copies a batch of new logs 
from the NAS, via NFS into Hadoop's HDFS, processes them, and copies the output 
back to the NAS. The HDFS is cleaned up at the end of each batch (ie, 
everything in it is deleted).

The problem is that reads off the NAS via NFS don't scale even if we try to 
scale the copying process by adding more threads to read in parallel.

If we instead stored the log files on an HDFS cluster (instead of NAS), it 
seems like the reads would scale since the data can be read from multiple data 
nodes at the same time without any contention (except network IO, which 
shouldn't be a problem).

I would appreciate if anyone could share any similar experience they have had 
with doing parallel reads from a storage HDFS.

Also is it a good idea to have a separate HDFS for storage vs for doing the 
batch processing ?

Best Regards,
TCK




      

Reply via email to