Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
Hi Ayan, "My problem is to get data on to HDFS for the first time." well, you have to put them on the cluster. With this simple command you can load them within HDFS: hdfs dfs -put $LOCAL_SRC_DIR $HDFS_PATH Then, i think you have to use coalesce in order to create an uber super mega file :) but

Re: Small files

2016-09-12 Thread ayan guha
Hi Thanks for your mail. I have read few of those posts. But always I see solutions assume data is on hdfs already. My problem is to get data on to HDFS for the first time. One way I can think of is to load small files on each cluster machines on the same folder. For example load file 1-0.3 mil o

Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
That is a good question Ayan. A few searches on so returns me: http://stackoverflow.com/questions/31009834/merge-multiple-small-files-in-to-few-larger-files-in-spark http://stackoverflow.com/questions/29025147/how-can-i-merge-spark-results-files-without-repartition-and-copymerge good luck, tell