acking-you commented on issue #15512:
URL: https://github.com/apache/datafusion/issues/15512#issuecomment-2813577217

   > Thank you [@acking-you](https://github.com/acking-you) for the idea. Does 
it similar to parquet filter pushdown? We are already trying to make it 
default. [#3463](https://github.com/apache/datafusion/issues/3463)
   > 
   > With parquet filter pushdown, we will only scan the filtered pages, but 
the topk filter pushdown is still in progress.
   
   This optimization targets the `select * xxx order by limit` scenario, where 
it can bring significant improvements. If we only push down the filter, it 
won't achieve the effect of materializing only the full columns within the 
limit rows. Therefore, for the `order by limit` scenario, we need to implement 
lazy materialization of columns.
   
   The approach is quite easy to understand. It involves completing the sorting 
by only reading the sorted columns and then performing a full scan to read the 
required "limit" rows. 
   
   ## Take `select * from data order by timestamp desc limit 10` as an example
   the most straightforward execution process for "order by limit" might look 
like the following:
   ```sql
   ///  Initial RecordBatch:
   ///    +----------------+---------------------+
   ///    |     Columns     |      Rows           |
   ///    +----------------+---------------------+
   ///    | *All columns*   | row1, row2, ..., rowN |
   ///    +----------------+---------------------+
   ///          ↓
   ///  Full Table Scan:
   ///    (Read all columns and rows into memory)
   ///          ↓
   ///  Scanned RecordBatch:
   ///    +----------------+---------------------+
   ///    | timestamp      | ... (other columns) |
   ///    +----------------+---------------------+
   ///    | 2023-09-01     | ...                 |
   ///    | 2023-09-02     | ...                 |
   ///    | ...            | ...                 |
   ///    +----------------+---------------------+
   ///          ↓
   ///  Sort by `timestamp DESC`:
   ///    (Re-order rows via stable sort on timestamp)
   ///          ↓
   ///  Sorted RecordBatch:
   ///    +----------------+---------------------+
   ///    | timestamp      | ... (other columns) |
   ///    +----------------+---------------------+
   ///    | 2023-09-15     | ...                 |  
   ///    | 2023-09-14     | ...                 |
   ///    | ...            | ...                 |
   ///    +----------------+---------------------+
   ///          ↓
   ///  Apply `LIMIT 10`:
   ///    (Select top 10 rows from sorted batch)
   ///          ↓
   ///  Final RecordBatch:
   ///    +----------------+---------------------+
   ///    | timestamp      | ... (other columns) |
   ///    +----------------+---------------------+
   ///    | 2023-09-15     | ...                 | 
   ///    | 2023-09-14     | ...                 |
   ///    | ... (8 more)   | ...                 |
   ///    +----------------+---------------------+
   ```
   
   The effect after having delayed materialization of non-ordered columns is as 
follows:
   ```sql
   ///  Initial Data Source:
   ///    +----------------+---------------------+
   ///    |     Columns     |      Rows           |
   ///    +----------------+---------------------+
   ///    | timestamp      | 2023-09-01, ...     |
   ///    | other_col1     | data1, ...          |
   ///    | other_col2     | data2, ...          |
   ///    | ...            | ...                 |
   ///    +----------------+---------------------+
   ///          ↓
   ///  Projection Scan:
   ///    (Only read `timestamp` + generate row IDs)
   ///          ↓
   ///  Scanned RecordBatch:
   ///    +----------------+---------------------+
   ///    | row_id         | timestamp           |
   ///    +----------------+---------------------+
   ///    | 0              | 2023-09-01          |
   ///    | 1              | 2023-09-02          |
   ///    | ...            | ...                 |
   ///    | N-1            | 2023-09-15          |
   ///    +----------------+---------------------+
   ///          ↓
   ///  Sort by `timestamp DESC`:
   ///    (Sort only row_id and timestamp)
   ///          ↓
   ///  Sorted Indexes: [15, 14, 8, ...]  -- List of original row numbers 
sorted by timestamp
   ///  Sorted Timestamps:
   ///    +----------------+
   ///    | 2023-09-15     |
   ///    | 2023-09-14     |
   ///    | ...            |
   ///    +----------------+
   ///          ↓
   ///  Apply `LIMIT 10`:
   ///    (Select top 10 row_ids)
   ///          ↓
   ///  Final Indexes: [15, 14, 8, 3, 7, 2, 5, 11, 9, 6]
   ///          ↓
   ///  Fetch Other Columns by row_id:
   ///    (Random access to original data via indexes)
   ///          ↓
   ///  Final RecordBatch:
   ///    +----------------+---------------------+
   ///    | timestamp      | other_col1 | ...     |
   ///    +----------------+---------------------+
   ///    | 2023-09-15     | data15     | ...     |  -- The original row 
corresponding to row_id=15
   ///    | 2023-09-14     | data14     | ...     |  -- The original row 
corresponding to row_id=14
   ///    | ... (8 more)   | ...        | ...     |
   ///    +----------------+---------------------+
   ```
   
   ## My opinion
   In my previous attempt 
[link](https://github.com/apache/datafusion/issues/15512#issuecomment-2813004760),
 I found that in order to read 10 rows of data, DataFusion would end up 
scanning an additional 104 columns, which is a significant overhead. I believe 
the approach would be very helpful for this scenario.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to