Re: How do we control output part files created by Spark job?

Srikanth Tue, 07 Jul 2015 15:38:21 -0700

Did you do

        yourRdd.coalesce(6).saveAsTextFile()


                        or

        yourRdd.coalesce(6)
        yourRdd.saveAsTextFile()
?

Srikanth

On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote:

> Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
> doesn't reduce part-xxxxx files. Even after calling above method I still
> see around 200 small part files of size 20 mb each which is again orc files.
>
>
> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> Try coalesce function to limit no of part files
>> On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.ka...@gmail.com> wrote:
>>
>>> Hi I am having couple of Spark jobs which processes thousands of files
>>> every
>>> day. File size may very from MBs to GBs. After finishing job I usually
>>> save
>>> using the following code
>>>
>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>> file as
>>> of Spark 1.4
>>>
>>> Spark job creates plenty of small part files in final output directory.
>>> As
>>> far as I understand Spark creates part file for each partition/task
>>> please
>>> correct me if I am wrong. How do we control amount of part files Spark
>>> creates? Finally I would like to create Hive table using these
>>> parquet/orc
>>> directory and I heard Hive is slow when we have large no of small files.
>>> Please guide I am new to Spark. Thanks in advance.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>

Re: How do we control output part files created by Spark job?

Reply via email to