suremarc commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2780379669

   > I'll open a follow-up PR to make it default
   
   I think one of the asks in the original post was additional tests. I think 
some of the asks are already covered in the sqllogictest 
([parquet_sorted_statistics.slt](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt)),
 some not, so I'll try to summarize here:
   
   # Case 1: Flexible file schemas
   
   > The schema of the files is different but compatible (e.g. one file as 
(time, date, symbol) but the other file had (date, symbol, time) for example 
([source](https://github.com/apache/datafusion/pull/9593#discussion_r1586804068))
   
   > Create files: file1.parquet, file2.parquet both sorted on a but file 1 has 
the columns in the order a, b, c and file has the columns in the order c, b, a. 
The keyranges of values of a should be non overlapping 
([source](https://github.com/apache/datafusion/issues/10336#issuecomment-2094127979))
   
   As far as I know this isn't covered in any tests, based on my understanding 
it shouldn't break anything but obviously we'd love to have that verified in a 
test 😄 
   
   # Case 2: Order by subset of columns
   
   > The query orders by a subset of the columns (e.g. ORDER BY time) 
([source](https://github.com/apache/datafusion/pull/9593#discussion_r1586804068))
   
   This is covered in basically every single query in the sqllogictest, so I 
think this is fine. 
   
   # Case 3: Order by non-ORDER BY columns
   
   > The query orders by a subset of the columns that is not the sort order 
(ORDER BY date) 
([source](https://github.com/apache/datafusion/pull/9593#discussion_r1586804068))
   
   I believe this is missing, if I understand correctly expected behavior here 
is failure. 
   
   # Case 4: Files start out of order
   
   > I think all these tests also always have the first file with the minimum 
stastistics value -- can you possibly also test what happens when it is not 
(aka add a test that runs this test with file ids 2, 1, 0)? 
([source](https://github.com/apache/datafusion/pull/9593#discussion_r1585524176))
   
   I think this is probably covered by the sqllogictests, specifically the ones 
doing descending ordering. However it should be pretty easy to add a single new 
test case to the unit tests for `FileScanConfig::split_groups_by_statistics`, 
which are located 
[here](https://github.com/apache/datafusion/blob/1a3917545c34e162272af9200da12860744c1abd/datafusion/datasource/src/file_scan_config.rs#L1847)
   
   --
   
   I realize we're eager to get this feature out, but I think this is one of 
the first optimizations that rely on statistics for correctness, so it's 
important we get this right and ensure a healthy amount of tests are in place. 
   
   cc @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to