Cross-posting from @user. Hello, guys!
I work on implementation of custom DataSource for Spark Data Frame API and have a question: If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source. Do I have a built-in option to tell spark that data from each partition already sorted? It seems that Spark can benefit from usage of already sorted partitions. By using of distributed merge sort algorithm, for example. Does it make sense for you? 28.11.2017 18:42, Michael Artz пишет:
I'm not sure other than retrieving from a hive table that is already sorted. This sounds cool though, would be interested to know this as well On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov....@gmail.com <mailto:nizhikov....@gmail.com>> wrote: Hello, guys! I work on implementation of custom DataSource for Spark Data Frame API and have a question: If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source. Do I have a built-in option to tell spark that data from each partition already sorted? It seems that Spark can benefit from usage of already sorted partitions. By using of distributed merge sort algorithm, for example. Does it make sense for you? --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org>
--------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org