Hi Ayan,
"My problem is to get data on to HDFS for the first time."
well, you have to put them on the cluster. With this simple command you can
load them within HDFS:
hdfs dfs -put $LOCAL_SRC_DIR $HDFS_PATH
Then, i think you have to use coalesce in order to create an uber super
mega file :) but
Hi
Thanks for your mail. I have read few of those posts. But always I see
solutions assume data is on hdfs already. My problem is to get data on to
HDFS for the first time.
One way I can think of is to load small files on each cluster machines on
the same folder. For example load file 1-0.3 mil o
That is a good question Ayan. A few searches on so returns me:
http://stackoverflow.com/questions/31009834/merge-multiple-small-files-in-to-few-larger-files-in-spark
http://stackoverflow.com/questions/29025147/how-can-i-merge-spark-results-files-without-repartition-and-copymerge
good luck, tell