[jira] [Commented] (CALCITE-4522) Sort cost should account for the number of columns in collation

Vladimir Sitnikov (Jira) Thu, 11 Mar 2021 22:40:11 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300077#comment-17300077
 ]


Vladimir Sitnikov commented on CALCITE-4522:
--------------------------------------------

{quote}There is a significant per-row overhead in a sort algorithm even if 
comparisons are very cheap{quote}

Could you please clarify what is the overhead?
We discuss CPU cost here, aren't we?
N Log N comes from https://en.m.wikipedia.org/wiki/Comparison_sort which reads
{quote}Wikipedia: A comparison sort must have an average-case lower bound of 
Ω(n log n) comparison operations,[1] which is known as linearithmic time{quote}
For instance, arrayList.sort(...) is inplace, and its average runtime is 
dominated by comparison.

{quote}The only significant difference is sorting on zero fields (i.e. not 
sorting at all) and sorting on 1 or more fields{quote}
Julian, I do not understand why do you think that a single ideal example proves 
me wrong. Apparently the runtime of comparison (e.g. average) does depend on 
the complexity of the comparator function. The complexity of the comparator 
does have relation with the number of columns compared and its types (e.g. ints 
are easier to compare than strings)

Here's a recent case from PostgreSQL mailing list that shows "bad plan being 
selected" issue caused by the fact that PG does not account for the number of 
columns in the sort key: 
https://www.postgresql.org/message-id/CAKU4AWqx2TeXyhmr1TNBt0u+WNCr=dpTTpNcwudE=knjbfv...@mail.gmail.com

The current inmemory costing model in PostgreSQL agrees with me with an 
exception that miss "number of keys in sort" feature.


> Sort cost should account for the number of columns in collation
> ---------------------------------------------------------------
>
>                 Key: CALCITE-4522
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4522
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: hqx
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> The old method to compute the cost of sort has some problem.
>  # When the RelCollation is empty, there is no need to sort, but it still 
> compute the cpu cost of sort.
>  # use n * log\(n) * row_byte to estimate the cpu cost may be inaccurate, 
> where n means the output row count of the sort operator, and row_byte means 
> the average bytes of one row .
> Instead, I give follow suggestion.
>  # the cpu cost is zero if the RelCollation is empty.
>  # let heap_size be min\(offset + output_count, input_count), and use 
> input_count * log\(heap_size)* row_byte to compute the cpu cost.
> When fetch is zero, I found the output_count is 1 not 0. This conveniently 
> ensure the log\(heap_size) no less than zero



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CALCITE-4522) Sort cost should account for the number of columns in collation

Reply via email to