Re: Spark Data Frame. PreSorded partitions

Li Jin Mon, 04 Dec 2017 07:38:00 -0800

Just to give another data point: most of the data we use with Spark are
sorted on disk, having a way to allow data source to pass ordered
distributed to DataFrames is really useful for us.


On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков <[email protected]>
wrote:

> Hello, guys.
>
> Thank you for answers!
>
> > I think pushing down a sort .... could make a big difference.
> > You can however proposes to the data source api 2 to be included.
>
> Jörn, are you talking about this jira issue? -
> https://issues.apache.org/jira/browse/SPARK-15689
> Is there any additional documentation I has to learn before making any
> proposition?
>
>
>
> 04.12.2017 14:05, Holden Karau пишет:
>
>> I think pushing down a sort (or really more in the case where the data is
>> already naturally returned in sorted order on some column) could make a big
>> difference. Probably the simplest argument for a lot of time being spent
>> sorting (in some use cases) is the fact it's still one of the standard
>> benchmarks.
>>
>> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     I do not think that the data source api exposes such a thing. You can
>> however proposes to the data source api 2 to be included.
>>
>>     However there are some caveats , because sorted can mean two
>> different things (weak vs strict order).
>>
>>     Then, is really a lot of time lost because of sorting? The best thing
>> is to not read data that is not needed at all (see min/max indexes in
>> orc/parquet or bloom filters in Orc). What is not read
>>     does not need to be sorted. See also predicate pushdown.
>>
>>      > On 4. Dec 2017, at 07:50, Николай Ижиков <[email protected]
>> <mailto:[email protected]>> wrote:
>>      >
>>      > Cross-posting from @user.
>>      >
>>      > Hello, guys!
>>      >
>>      > I work on implementation of custom DataSource for Spark Data Frame
>> API and have a question:
>>      >
>>      > If I have a `SELECT * FROM table1 ORDER BY some_column` query I
>> can sort data inside a partition in my data source.
>>      >
>>      > Do I have a built-in option to tell spark that data from each
>> partition already sorted?
>>      >
>>      > It seems that Spark can benefit from usage of already sorted
>> partitions.
>>      > By using of distributed merge sort algorithm, for example.
>>      >
>>      > Does it make sense for you?
>>      >
>>      >
>>      > 28.11.2017 18:42, Michael Artz пишет:
>>      >> I'm not sure other than retrieving from a hive table that is
>> already sorted.  This sounds cool though, would be interested to know this
>> as well
>>      >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <
>> [email protected] <mailto:[email protected]> <mailto:
>> [email protected] <mailto:[email protected]>>> wrote:
>>      >>    Hello, guys!
>>      >>    I work on implementation of custom DataSource for Spark Data
>> Frame API and have a question:
>>      >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query
>> I can sort data inside a partition in my data source.
>>      >>    Do I have a built-in option to tell spark that data from each
>> partition already sorted?
>>      >>    It seems that Spark can benefit from usage of already sorted
>> partitions.
>>      >>    By using of distributed merge sort algorithm, for example.
>>      >>    Does it make sense for you?
>>      >>    ------------------------------------------------------------
>> ---------
>>      >>    To unsubscribe e-mail: [email protected]
>> <mailto:[email protected]> <mailto:user-unsubscribe@spark
>> .apache.org <mailto:[email protected]>>
>>      >
>>      > ------------------------------------------------------------
>> ---------
>>      > To unsubscribe e-mail: [email protected] <mailto:
>> [email protected]>
>>      >
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe e-mail: [email protected] <mailto:
>> [email protected]>
>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: Spark Data Frame. PreSorded partitions

Reply via email to