[ 
https://issues.apache.org/jira/browse/CALCITE-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300970#comment-17300970
 ] 

Julian Hyde commented on CALCITE-4522:
--------------------------------------

The PR is not in good shape. It was in good shape a few days ago, because I 
have been actively involved in this case for 10 days, and then you jumped in 
with suggestions that made it worse.

The bug now states that a limit/offset operator has zero cpu cost. Untrue.

It also costs “order by a, b, c, d” as twice as expensive as “order by a, b”. 
Untrue. 

The code refers to “heap size”. A general purpose sort is unlikely to use a 
heap. This cost function is for a general purpose sort, not an in memory sort.

You continually cite comparison count. Comparisons are not the only CPU cost. 
In sort algorithms, moving data - reading and writing from memory - is a major 
source of CPU cost (bus wait shows up as CPU cost). 

Your example of deciding between HashAgg and sort in postgres is not 
applicable. It is deciding between HashAgg and Sort, not between two sorts. One 
of the known limitations of hash-based algorithms is that they have to process 
the entire key. If anything, it proves my point - that a sort over N keys is 
cheap because usually comparisons stop after the first key. 

If your past behavior is any guide, you will now attempt to refute each of my 
above points, and filibuster this whole discussion. Please don’t do this. 
Please just back off, and let my advice - as an expert who has been building 
commercial database systems for 30 years - stand.

If you doubt this, start a thread and ask other RDBMS experts to vote between 
my cost model and yours. There are several people with a PhD in database who 
will back me up. 

> Sort cost should account for the number of columns in collation
> ---------------------------------------------------------------
>
>                 Key: CALCITE-4522
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4522
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: hqx
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> The old method to compute the cost of sort has some problem.
>  # When the RelCollation is empty, there is no need to sort, but it still 
> compute the cpu cost of sort.
>  # use n * log\(n) * row_byte to estimate the cpu cost may be inaccurate, 
> where n means the output row count of the sort operator, and row_byte means 
> the average bytes of one row .
> Instead, I give follow suggestion.
>  # the cpu cost is zero if the RelCollation is empty.
>  # let heap_size be min\(offset + output_count, input_count), and use 
> input_count * log\(heap_size)* row_byte to compute the cpu cost.
> When fetch is zero, I found the output_count is 1 not 0. This conveniently 
> ensure the log\(heap_size) no less than zero



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to