Joe McDonnell created IMPALA-13944: -------------------------------------- Summary: Top-N should have a mode to handle duplicates deterministically Key: IMPALA-13944 URL: https://issues.apache.org/jira/browse/IMPALA-13944 Project: IMPALA Issue Type: Task Components: Frontend Affects Versions: Impala 5.0.0 Reporter: Joe McDonnell
Top-N is not deterministic when there are duplicates: {noformat} > select id from functional.alltypes order by int_col limit 5; +------+ | id | +------+ | 1880 | | 1890 | | 1870 | | 1840 | | 1850 | +------+ Fetched 5 row(s) in 0.12s > select id from functional.alltypes order by int_col limit 5; +-----+ | id | +-----+ | 970 | | 980 | | 960 | | 930 | | 940 | +-----+ Fetched 5 row(s) in 0.12s{noformat} This is expected, but the non-determinism can create problems if a query has multiple identical Top-Ns that are expected to be the same. This non-determinism also causes problems for tuple caching. The Top-N can be made deterministic by ordering over additional columns until the rows are literally identical. Having a mode that adds all the additional columns to make it deterministic would avoid the need for customers to do this themselves. Adding the additional columns would have a very small impact on performance when there are few duplicates, but it would definitely add a performance penalty when there are many duplicates. -- This message was sent by Atlassian Jira (v8.20.10#820010)