Just to give another data point: most of the data we use with Spark are
sorted on disk, having a way to allow data source to pass ordered
distributed to DataFrames is really useful for us.

On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков <nizhikov....@gmail.com>
wrote:

> Hello, guys.
>
> Thank you for answers!
>
> > I think pushing down a sort .... could make a big difference.
> > You can however proposes to the data source api 2 to be included.
>
> Jörn, are you talking about this jira issue? -
> https://issues.apache.org/jira/browse/SPARK-15689
> Is there any additional documentation I has to learn before making any
> proposition?
>
>
>
> 04.12.2017 14:05, Holden Karau пишет:
>
>> I think pushing down a sort (or really more in the case where the data is
>> already naturally returned in sorted order on some column) could make a big
>> difference. Probably the simplest argument for a lot of time being spent
>> sorting (in some use cases) is the fact it's still one of the standard
>> benchmarks.
>>
>> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfra...@gmail.com
>> <mailto:jornfra...@gmail.com>> wrote:
>>
>>     I do not think that the data source api exposes such a thing. You can
>> however proposes to the data source api 2 to be included.
>>
>>     However there are some caveats , because sorted can mean two
>> different things (weak vs strict order).
>>
>>     Then, is really a lot of time lost because of sorting? The best thing
>> is to not read data that is not needed at all (see min/max indexes in
>> orc/parquet or bloom filters in Orc). What is not read
>>     does not need to be sorted. See also predicate pushdown.
>>
>>      > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov....@gmail.com
>> <mailto:nizhikov....@gmail.com>> wrote:
>>      >
>>      > Cross-posting from @user.
>>      >
>>      > Hello, guys!
>>      >
>>      > I work on implementation of custom DataSource for Spark Data Frame
>> API and have a question:
>>      >
>>      > If I have a `SELECT * FROM table1 ORDER BY some_column` query I
>> can sort data inside a partition in my data source.
>>      >
>>      > Do I have a built-in option to tell spark that data from each
>> partition already sorted?
>>      >
>>      > It seems that Spark can benefit from usage of already sorted
>> partitions.
>>      > By using of distributed merge sort algorithm, for example.
>>      >
>>      > Does it make sense for you?
>>      >
>>      >
>>      > 28.11.2017 18:42, Michael Artz пишет:
>>      >> I'm not sure other than retrieving from a hive table that is
>> already sorted.  This sounds cool though, would be interested to know this
>> as well
>>      >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <
>> nizhikov....@gmail.com <mailto:nizhikov....@gmail.com> <mailto:
>> nizhikov....@gmail.com <mailto:nizhikov....@gmail.com>>> wrote:
>>      >>    Hello, guys!
>>      >>    I work on implementation of custom DataSource for Spark Data
>> Frame API and have a question:
>>      >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query
>> I can sort data inside a partition in my data source.
>>      >>    Do I have a built-in option to tell spark that data from each
>> partition already sorted?
>>      >>    It seems that Spark can benefit from usage of already sorted
>> partitions.
>>      >>    By using of distributed merge sort algorithm, for example.
>>      >>    Does it make sense for you?
>>      >>    ------------------------------------------------------------
>> ---------
>>      >>    To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> <mailto:user-unsubscr...@spark.apache.org> <mailto:user-unsubscribe@spark
>> .apache.org <mailto:user-unsubscr...@spark.apache.org>>
>>      >
>>      > ------------------------------------------------------------
>> ---------
>>      > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <mailto:
>> dev-unsubscr...@spark.apache.org>
>>      >
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <mailto:
>> dev-unsubscr...@spark.apache.org>
>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to