Hi Srikant thanks for the response. I have the following code:
hiveContext.sql("insert into... ").coalesce(6)
Above code does not create 6 part files it creates around 200 small files.
Please guide. Thanks.
On Jul 8, 2015 4:07 AM, "Srikanth" <[email protected]> wrote:
> Did you do
>
> yourRdd.coalesce(6).saveAsTextFile()
>
> or
>
> yourRdd.coalesce(6)
> yourRdd.saveAsTextFile()
> ?
>
> Srikanth
>
> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <[email protected]>
> wrote:
>
>> Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
>> doesn't reduce part-xxxxx files. Even after calling above method I still
>> see around 200 small part files of size 20 mb each which is again orc files.
>>
>>
>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
>> [email protected]> wrote:
>>
>>> Try coalesce function to limit no of part files
>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <[email protected]> wrote:
>>>
>>>> Hi I am having couple of Spark jobs which processes thousands of files
>>>> every
>>>> day. File size may very from MBs to GBs. After finishing job I usually
>>>> save
>>>> using the following code
>>>>
>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>>> file as
>>>> of Spark 1.4
>>>>
>>>> Spark job creates plenty of small part files in final output directory.
>>>> As
>>>> far as I understand Spark creates part file for each partition/task
>>>> please
>>>> correct me if I am wrong. How do we control amount of part files Spark
>>>> creates? Finally I would like to create Hive table using these
>>>> parquet/orc
>>>> directory and I heard Hive is slow when we have large no of small files.
>>>> Please guide I am new to Spark. Thanks in advance.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>
>