Re: How do we control output part files created by Spark job?

Umesh Kacha Sat, 11 Jul 2015 02:08:05 -0700

Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
I think reducing shuffle partitions will slower my group by query of
hiveContext or it wont slow it down please guide.


On Sat, Jul 11, 2015 at 7:41 AM, Srikanth <srikanth...@gmail.com> wrote:

> Is there a join involved in your sql?
> Have a look at spark.sql.shuffle.partitions?
>
> Srikanth
>
> On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote:
>
>> Hi Srikant thanks for the response. I have the following code:
>>
>> hiveContext.sql("insert into... ").coalesce(6)
>>
>> Above code does not create 6 part files it creates around 200 small
>> files.
>>
>> Please guide. Thanks.
>> On Jul 8, 2015 4:07 AM, "Srikanth" <srikanth...@gmail.com> wrote:
>>
>>> Did you do
>>>
>>>         yourRdd.coalesce(6).saveAsTextFile()
>>>
>>>                         or
>>>
>>>         yourRdd.coalesce(6)
>>>         yourRdd.saveAsTextFile()
>>> ?
>>>
>>> Srikanth
>>>
>>> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <umesh.ka...@gmail.com>
>>> wrote:
>>>
>>>> Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
>>>> doesn't reduce part-xxxxx files. Even after calling above method I still
>>>> see around 200 small part files of size 20 mb each which is again orc 
>>>> files.
>>>>
>>>>
>>>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
>>>> vsathishkuma...@gmail.com> wrote:
>>>>
>>>>> Try coalesce function to limit no of part files
>>>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.ka...@gmail.com> wrote:
>>>>>
>>>>>> Hi I am having couple of Spark jobs which processes thousands of
>>>>>> files every
>>>>>> day. File size may very from MBs to GBs. After finishing job I
>>>>>> usually save
>>>>>> using the following code
>>>>>>
>>>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>>>>> file as
>>>>>> of Spark 1.4
>>>>>>
>>>>>> Spark job creates plenty of small part files in final output
>>>>>> directory. As
>>>>>> far as I understand Spark creates part file for each partition/task
>>>>>> please
>>>>>> correct me if I am wrong. How do we control amount of part files Spark
>>>>>> creates? Finally I would like to create Hive table using these
>>>>>> parquet/orc
>>>>>> directory and I heard Hive is slow when we have large no of small
>>>>>> files.
>>>>>> Please guide I am new to Spark. Thanks in advance.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>
>>>
>

Re: How do we control output part files created by Spark job?

Reply via email to