Just to give another data point: most of the data we use with Spark are sorted on disk, having a way to allow data source to pass ordered distributed to DataFrames is really useful for us.
On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков <nizhikov....@gmail.com> wrote: > Hello, guys. > > Thank you for answers! > > > I think pushing down a sort .... could make a big difference. > > You can however proposes to the data source api 2 to be included. > > Jörn, are you talking about this jira issue? - > https://issues.apache.org/jira/browse/SPARK-15689 > Is there any additional documentation I has to learn before making any > proposition? > > > > 04.12.2017 14:05, Holden Karau пишет: > >> I think pushing down a sort (or really more in the case where the data is >> already naturally returned in sorted order on some column) could make a big >> difference. Probably the simplest argument for a lot of time being spent >> sorting (in some use cases) is the fact it's still one of the standard >> benchmarks. >> >> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfra...@gmail.com >> <mailto:jornfra...@gmail.com>> wrote: >> >> I do not think that the data source api exposes such a thing. You can >> however proposes to the data source api 2 to be included. >> >> However there are some caveats , because sorted can mean two >> different things (weak vs strict order). >> >> Then, is really a lot of time lost because of sorting? The best thing >> is to not read data that is not needed at all (see min/max indexes in >> orc/parquet or bloom filters in Orc). What is not read >> does not need to be sorted. See also predicate pushdown. >> >> > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov....@gmail.com >> <mailto:nizhikov....@gmail.com>> wrote: >> > >> > Cross-posting from @user. >> > >> > Hello, guys! >> > >> > I work on implementation of custom DataSource for Spark Data Frame >> API and have a question: >> > >> > If I have a `SELECT * FROM table1 ORDER BY some_column` query I >> can sort data inside a partition in my data source. >> > >> > Do I have a built-in option to tell spark that data from each >> partition already sorted? >> > >> > It seems that Spark can benefit from usage of already sorted >> partitions. >> > By using of distributed merge sort algorithm, for example. >> > >> > Does it make sense for you? >> > >> > >> > 28.11.2017 18:42, Michael Artz пишет: >> >> I'm not sure other than retrieving from a hive table that is >> already sorted. This sounds cool though, would be interested to know this >> as well >> >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" < >> nizhikov....@gmail.com <mailto:nizhikov....@gmail.com> <mailto: >> nizhikov....@gmail.com <mailto:nizhikov....@gmail.com>>> wrote: >> >> Hello, guys! >> >> I work on implementation of custom DataSource for Spark Data >> Frame API and have a question: >> >> If I have a `SELECT * FROM table1 ORDER BY some_column` query >> I can sort data inside a partition in my data source. >> >> Do I have a built-in option to tell spark that data from each >> partition already sorted? >> >> It seems that Spark can benefit from usage of already sorted >> partitions. >> >> By using of distributed merge sort algorithm, for example. >> >> Does it make sense for you? >> >> ------------------------------------------------------------ >> --------- >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> <mailto:user-unsubscr...@spark.apache.org> <mailto:user-unsubscribe@spark >> .apache.org <mailto:user-unsubscr...@spark.apache.org>> >> > >> > ------------------------------------------------------------ >> --------- >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <mailto: >> dev-unsubscr...@spark.apache.org> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <mailto: >> dev-unsubscr...@spark.apache.org> >> >> >> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >