Did you do yourRdd.coalesce(6).saveAsTextFile()
or yourRdd.coalesce(6) yourRdd.saveAsTextFile() ? Srikanth On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi I tried both approach using df. repartition(6) and df.coalesce(6) it > doesn't reduce part-xxxxx files. Even after calling above method I still > see around 200 small part files of size 20 mb each which is again orc files. > > > On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu < > vsathishkuma...@gmail.com> wrote: > >> Try coalesce function to limit no of part files >> On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.ka...@gmail.com> wrote: >> >>> Hi I am having couple of Spark jobs which processes thousands of files >>> every >>> day. File size may very from MBs to GBs. After finishing job I usually >>> save >>> using the following code >>> >>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR >>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC >>> file as >>> of Spark 1.4 >>> >>> Spark job creates plenty of small part files in final output directory. >>> As >>> far as I understand Spark creates part file for each partition/task >>> please >>> correct me if I am wrong. How do we control amount of part files Spark >>> creates? Finally I would like to create Hive table using these >>> parquet/orc >>> directory and I heard Hive is slow when we have large no of small files. >>> Please guide I am new to Spark. Thanks in advance. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >