Well usually you sort only on a certain column and not on all columns so most
of the columns will always be unsorted, Spark may then still need to sort if
you for example join (for some joins) on an unsorted column.
That being said, depending on the data you may not want to sort it, but cluster
Sorry, s/ordered distributed/ordered distribution/g
On Mon, Dec 4, 2017 at 10:37 AM, Li Jin wrote:
> Just to give another data point: most of the data we use with Spark are
> sorted on disk, having a way to allow data source to pass ordered
> distributed to DataFrames is really useful for us.
>
Just to give another data point: most of the data we use with Spark are
sorted on disk, having a way to allow data source to pass ordered
distributed to DataFrames is really useful for us.
On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков
wrote:
> Hello, guys.
>
> Thank you for answers!
>
> > I thi
Data Source V2 is still under development. Ordering reporting is one of the
planned features, but it's not done yet, we are still thinking about what
the API should be, e.g. we need to include sort order, null first/last and
other sorting related properties.
On Mon, Dec 4, 2017 at 10:12 PM, Никола
Hello, guys.
Thank you for answers!
> I think pushing down a sort could make a big difference.
> You can however proposes to the data source api 2 to be included.
Jörn, are you talking about this jira issue? -
https://issues.apache.org/jira/browse/SPARK-15689
Is there any additional docum
I think pushing down a sort (or really more in the case where the data is
already naturally returned in sorted order on some column) could make a big
difference. Probably the simplest argument for a lot of time being spent
sorting (in some use cases) is the fact it's still one of the standard
bench
I do not think that the data source api exposes such a thing. You can however
proposes to the data source api 2 to be included.
However there are some caveats , because sorted can mean two different things
(weak vs strict order).
Then, is really a lot of time lost because of sorting? The best t