Hi Ivan,


If the error you are referring to is that the data is out of order, it may
be that the data is out of order due to the “repartition”. You can try to
use the “repartitionByRange”

scala> val df = sc.parallelize (1 to 1000, 10).toDF("v")

scala> df.repartitionByRange(5,column("v")).sortWithinPartitions("v").
write.parquet(outputPath)



Best Regards,

m li
Ivan Petrov wrote
> Ah... makes sense, thank you. i tried sortWithinPartition before and
> replaced with sort. It was a mistake.
> 
> чт, 25 февр. 2021 г. в 15:25, Pietro Gentile <

> pietro.gentile89.developer@

>>:
> 
>> Hi,
>>
>> It is because of *repartition* before the *sort* method invocation. If
>> you reverse them you'll see 5 output files.
>>
>> Regards,
>> Pietro
>>
>> Il giorno mer 24 feb 2021 alle ore 16:43 Ivan Petrov &lt;

> capacytron@

> &gt;
>> ha scritto:
>>
>>> Hi, I'm trying to control the size and/or count of spark output.
>>>
>>> Here is my code. I expect to get 5 files  but I get dozens of small
>>> files.
>>> Why?
>>>
>>> dataset
>>> .repartition(5)
>>> .sort("long_repeated_string_in_this_column") // should be better
>>> compressed with snappy
>>> .write
>>> .parquet(outputPath)
>>>
>>





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to