Re: dataset sort

Till Rohrmann Fri, 09 Feb 2018 01:37:02 -0800

Hi David,

Flink only supports sorting within partitions. Thus, if you want to write
out a globally sorted dataset you should set the parallelism to 1 which
effectively results in a single partition. Decreasing the parallelism of an
operator will cause the individual partitions to lose its sort order
because the individual partitions are read in a non deterministic order.


Cheers,
Till


On Thu, Feb 8, 2018 at 8:07 PM, david westwood <david.d.westw...@gmail.com>
wrote:

> Hi:
>
> I would like to sort historical data using the dataset api.
>
> env.setParallelism(10)
>
> val dataset = [(Long, String)] ..
> .paritionByRange(_._1)
> .sortPartition(_._1, Order.ASCEDING)
> .writeAsCsv("mydata.csv").setParallelism(1)
>
> the data is out of order (in local order)
> but
> .print()
> prints the data in to correct order. I have run a small toy sample
> multiple times.
>
> Is there a way to sort the entire dataset with parallelism > 1 and write
> it to a single file in ascending order?
>

Re: dataset sort

Reply via email to