Re: How do we control output part files created by Spark job?

Srikanth Sat, 11 Jul 2015 07:01:14 -0700

Reducing no.of partitions may have impact on memory consumption. Especially
if there is uneven distribution of key used in groupBy.
Depends on your dataset.


On Sat, Jul 11, 2015 at 5:06 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote:

> Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
> I think reducing shuffle partitions will slower my group by query of
> hiveContext or it wont slow it down please guide.
>
> On Sat, Jul 11, 2015 at 7:41 AM, Srikanth <srikanth...@gmail.com> wrote:
>
>> Is there a join involved in your sql?
>> Have a look at spark.sql.shuffle.partitions?
>>
>> Srikanth
>>
>> On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha <umesh.ka...@gmail.com>
>> wrote:
>>
>>> Hi Srikant thanks for the response. I have the following code:
>>>
>>> hiveContext.sql("insert into... ").coalesce(6)
>>>
>>> Above code does not create 6 part files it creates around 200 small
>>> files.
>>>
>>> Please guide. Thanks.
>>> On Jul 8, 2015 4:07 AM, "Srikanth" <srikanth...@gmail.com> wrote:
>>>
>>>> Did you do
>>>>
>>>>         yourRdd.coalesce(6).saveAsTextFile()
>>>>
>>>>                         or
>>>>
>>>>         yourRdd.coalesce(6)
>>>>         yourRdd.saveAsTextFile()
>>>> ?
>>>>
>>>> Srikanth
>>>>
>>>> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <umesh.ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi I tried both approach using df. repartition(6) and df.coalesce(6)
>>>>> it doesn't reduce part-xxxxx files. Even after calling above method I 
>>>>> still
>>>>> see around 200 small part files of size 20 mb each which is again orc 
>>>>> files.
>>>>>
>>>>>
>>>>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
>>>>> vsathishkuma...@gmail.com> wrote:
>>>>>
>>>>>> Try coalesce function to limit no of part files
>>>>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.ka...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi I am having couple of Spark jobs which processes thousands of
>>>>>>> files every
>>>>>>> day. File size may very from MBs to GBs. After finishing job I
>>>>>>> usually save
>>>>>>> using the following code
>>>>>>>
>>>>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>>>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>>>>>> file as
>>>>>>> of Spark 1.4
>>>>>>>
>>>>>>> Spark job creates plenty of small part files in final output
>>>>>>> directory. As
>>>>>>> far as I understand Spark creates part file for each partition/task
>>>>>>> please
>>>>>>> correct me if I am wrong. How do we control amount of part files
>>>>>>> Spark
>>>>>>> creates? Finally I would like to create Hive table using these
>>>>>>> parquet/orc
>>>>>>> directory and I heard Hive is slow when we have large no of small
>>>>>>> files.
>>>>>>> Please guide I am new to Spark. Thanks in advance.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>
>

Re: How do we control output part files created by Spark job?

Reply via email to