adriangb opened a new issue, #15037:
URL: https://github.com/apache/datafusion/issues/15037

   ### Is your feature request related to a problem or challenge?
   
   From discussion with @alamb yesterday the idea came up of optimizing queries 
like `select * from data order by timestamp desc limit 10` for the case where 
the data is not perfectly sorted by `timestamp` but mostly follows a sorted 
pattern.
   
   You can imagine this data gets created if multiple sources with clock skews, 
network delays, etc. are writing data and you don't do anything fancy to 
guarantee perfect sorting by `timestamp` (i.e. you naively write out the data 
to Parquet, maybe do some compaction, etc.). The point is that 99% of 
yesterday's files have a `timestamp` smaller than 99% of today's files but 
there may be a couple seconds of overlap between files. To be concrete, let's 
say this is our data:
   
   | file | min | max |
   |------|-----|-----|
   | 1    | 1   | 10  |
   | 2    | 9   | 19  |
   | 3    | 20  | 31  |
   | 4    | 30  | 35  |
   
   Currently DataFusion will exhaustively open each file, read the `timestamp` 
column and feed it into a `TopK`.
   I think we can do a lot better if we:
   - Use file stats to decide which files to work on first. In this case it 
makes sense to start with file 4 and 3 (assuming we have parallelism of 2).
   - Let's say that between those two we have 10 rows, so we've already filled 
up our `TopK`. The only way more things would get added to our `TopK` is if 
they are greater than the smallest item already seen (let's say that's `20`, 
the smallest value in file 3).
   - Now we know just from statistics that we can skip files 2 and 1 because 
neither of them can have any `timestamp > 20`.
   
   Extrapolating this to scenarios where you have years worth / TBs of data and 
want a `limit 5` would yield orders of magnitude improvement I think.
   
   @alamb mentioned this sounds similar to [Dynamic 
Filters](https://docs.starburst.io/latest/admin/dynamic-filtering.html), I 
assume this must be a known technique (or my analysis may be completely wrong 😆 
) but I don't know what it would be called.
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to