Reducing no.of partitions may have impact on memory consumption. Especially if there is uneven distribution of key used in groupBy. Depends on your dataset.
On Sat, Jul 11, 2015 at 5:06 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10 > I think reducing shuffle partitions will slower my group by query of > hiveContext or it wont slow it down please guide. > > On Sat, Jul 11, 2015 at 7:41 AM, Srikanth <srikanth...@gmail.com> wrote: > >> Is there a join involved in your sql? >> Have a look at spark.sql.shuffle.partitions? >> >> Srikanth >> >> On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha <umesh.ka...@gmail.com> >> wrote: >> >>> Hi Srikant thanks for the response. I have the following code: >>> >>> hiveContext.sql("insert into... ").coalesce(6) >>> >>> Above code does not create 6 part files it creates around 200 small >>> files. >>> >>> Please guide. Thanks. >>> On Jul 8, 2015 4:07 AM, "Srikanth" <srikanth...@gmail.com> wrote: >>> >>>> Did you do >>>> >>>> yourRdd.coalesce(6).saveAsTextFile() >>>> >>>> or >>>> >>>> yourRdd.coalesce(6) >>>> yourRdd.saveAsTextFile() >>>> ? >>>> >>>> Srikanth >>>> >>>> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <umesh.ka...@gmail.com> >>>> wrote: >>>> >>>>> Hi I tried both approach using df. repartition(6) and df.coalesce(6) >>>>> it doesn't reduce part-xxxxx files. Even after calling above method I >>>>> still >>>>> see around 200 small part files of size 20 mb each which is again orc >>>>> files. >>>>> >>>>> >>>>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu < >>>>> vsathishkuma...@gmail.com> wrote: >>>>> >>>>>> Try coalesce function to limit no of part files >>>>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.ka...@gmail.com> wrote: >>>>>> >>>>>>> Hi I am having couple of Spark jobs which processes thousands of >>>>>>> files every >>>>>>> day. File size may very from MBs to GBs. After finishing job I >>>>>>> usually save >>>>>>> using the following code >>>>>>> >>>>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR >>>>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC >>>>>>> file as >>>>>>> of Spark 1.4 >>>>>>> >>>>>>> Spark job creates plenty of small part files in final output >>>>>>> directory. As >>>>>>> far as I understand Spark creates part file for each partition/task >>>>>>> please >>>>>>> correct me if I am wrong. How do we control amount of part files >>>>>>> Spark >>>>>>> creates? Finally I would like to create Hive table using these >>>>>>> parquet/orc >>>>>>> directory and I heard Hive is slow when we have large no of small >>>>>>> files. >>>>>>> Please guide I am new to Spark. Thanks in advance. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>> >>>>>>> >>>>> >>>> >> >