Reducing no.of partitions may have impact on memory consumption. Especially
if there is uneven distribution of key used in groupBy.
Depends on your dataset.

On Sat, Jul 11, 2015 at 5:06 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote:

> Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
> I think reducing shuffle partitions will slower my group by query of
> hiveContext or it wont slow it down please guide.
>
> On Sat, Jul 11, 2015 at 7:41 AM, Srikanth <srikanth...@gmail.com> wrote:
>
>> Is there a join involved in your sql?
>> Have a look at spark.sql.shuffle.partitions?
>>
>> Srikanth
>>
>> On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha <umesh.ka...@gmail.com>
>> wrote:
>>
>>> Hi Srikant thanks for the response. I have the following code:
>>>
>>> hiveContext.sql("insert into... ").coalesce(6)
>>>
>>> Above code does not create 6 part files it creates around 200 small
>>> files.
>>>
>>> Please guide. Thanks.
>>> On Jul 8, 2015 4:07 AM, "Srikanth" <srikanth...@gmail.com> wrote:
>>>
>>>> Did you do
>>>>
>>>>         yourRdd.coalesce(6).saveAsTextFile()
>>>>
>>>>                         or
>>>>
>>>>         yourRdd.coalesce(6)
>>>>         yourRdd.saveAsTextFile()
>>>> ?
>>>>
>>>> Srikanth
>>>>
>>>> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <umesh.ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi I tried both approach using df. repartition(6) and df.coalesce(6)
>>>>> it doesn't reduce part-xxxxx files. Even after calling above method I 
>>>>> still
>>>>> see around 200 small part files of size 20 mb each which is again orc 
>>>>> files.
>>>>>
>>>>>
>>>>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
>>>>> vsathishkuma...@gmail.com> wrote:
>>>>>
>>>>>> Try coalesce function to limit no of part files
>>>>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.ka...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi I am having couple of Spark jobs which processes thousands of
>>>>>>> files every
>>>>>>> day. File size may very from MBs to GBs. After finishing job I
>>>>>>> usually save
>>>>>>> using the following code
>>>>>>>
>>>>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>>>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>>>>>> file as
>>>>>>> of Spark 1.4
>>>>>>>
>>>>>>> Spark job creates plenty of small part files in final output
>>>>>>> directory. As
>>>>>>> far as I understand Spark creates part file for each partition/task
>>>>>>> please
>>>>>>> correct me if I am wrong. How do we control amount of part files
>>>>>>> Spark
>>>>>>> creates? Finally I would like to create Hive table using these
>>>>>>> parquet/orc
>>>>>>> directory and I heard Hive is slow when we have large no of small
>>>>>>> files.
>>>>>>> Please guide I am new to Spark. Thanks in advance.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>
>

Reply via email to