Re: [PR] [feat](condition cache) Support condition cache for external table [doris]

via GitHub Wed, 04 Mar 2026 07:49:17 -0800


github-actions[bot] commented on code in PR #60897:
URL: https://github.com/apache/doris/pull/60897#discussion_r2884565191



##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -505,13 +608,37 @@ Status RowGroupReader::_do_lazy_read(Block* block, size_t 
batch_size, size_t* re
         pre_read_rows = 0;
         pre_eof = false;
         FilterMap filter_map;
+        int64_t batch_base_row = _total_read_rows;
         RETURN_IF_ERROR(_read_column_data(block, 
_lazy_read_ctx.predicate_columns.first, batch_size,
                                           &pre_read_rows, &pre_eof, 
filter_map));
         if (pre_read_rows == 0) {
             DCHECK_EQ(pre_eof, true);
             break;
         }
         pre_raw_read_rows += pre_read_rows;
+
+        // Condition cache HIT: check if all rows in this batch are in false 
granules
+        bool cache_skip_batch = false;
+        if (_condition_cache_ctx && _condition_cache_ctx->is_hit) {
+            auto& cache = *_condition_cache_ctx->filter_result;
+            int64_t first_rg_pos = 
_read_ranges.get_row_index_by_pos(batch_base_row);
+            int64_t last_rg_pos =
+                    _read_ranges.get_row_index_by_pos(batch_base_row + 
pre_read_rows - 1);

Review Comment:
   **[Dead code]** This lazy-read HIT check (`if (_condition_cache_ctx && 
_condition_cache_ctx->is_hit)`) will never evaluate to true for Parquet.
   
   In `_filter_read_ranges_by_condition_cache()` (called during row group 
init), `_condition_cache_ctx` is set to `nullptr` after filtering 
`_read_ranges` on HIT. So by the time execution reaches this lazy-read path, 
`_condition_cache_ctx` is already null.
   
   The actual HIT optimization for Parquet is entirely handled by pre-filtering 
`_read_ranges` in the non-lazy path, which is correct and effective. This block 
(lines 616-641 approximately) is unreachable dead code.
   
   Consider either:
   1. Removing this dead code block, or
   2. If there's an intent to do per-batch cache checking in the lazy path 
(e.g., for future use), don't null out `_condition_cache_ctx` in 
`_filter_read_ranges_by_condition_cache()` and document the design.



##########
be/src/olap/rowset/segment_v2/row_ranges.h:
##########
@@ -275,6 +321,23 @@ class RowRanges {
         _count += range_to_add.count();
     }
 
+    // Returns the row index (within the original row space) of the pos-th 
element
+    // across all ranges. For example, if ranges are [0,3000) and [8000,11000),
+    // pos=0 returns 0, pos=2999 returns 2999, pos=3000 returns 8000.
+    int64_t get_row_index_by_pos(int64_t pos) const {
+        size_t remaining = pos;
+        for (const auto& range : _ranges) {
+            size_t range_len = range.count();
+            if (remaining < range_len) {
+                return range.from() + remaining;
+            }
+            remaining -= range_len;
+        }
+        // pos is out of bounds; return -1 to indicate invalid
+        DCHECK(false) << "pos " << pos << " is out of bounds for RowRanges 
with count " << _count;
+        return -1;
+    }
+
     uint64_t get_digest(uint64_t seed) const {

Review Comment:
   **[Bug risk]** `DCHECK(false)` vanishes in RELEASE builds, causing silent 
return of -1 for out-of-bounds positions. Per AGENTS.md coding standards, 
invariant violations should crash rather than silently continue.
   
   If `pos >= _count` is truly an invariant that should never be violated, this 
should use `DORIS_CHECK(false)` (or equivalently, `DORIS_CHECK(pos < _count)` 
at the function entry) to catch violations in all build types. A silent return 
of -1 in production could cause incorrect granule marking and data correctness 
issues.
   
   ```cpp
   // Suggested: replace DCHECK with DORIS_CHECK
   DORIS_CHECK(false) << "pos " << pos << " is out of bounds for RowRanges with 
count " << _count;
   ```



##########
be/src/vec/exec/format/orc/vorc_reader.h:
##########
@@ -700,6 +717,9 @@ class OrcReader : public GenericReader {
     std::unique_ptr<orc::ColumnVectorBatch> _batch;
     std::unique_ptr<orc::Reader> _reader = nullptr;
     std::unique_ptr<orc::RowReader> _row_reader;
+
+    uint64_t _last_read_row_number = -1;
+    std::shared_ptr<ConditionCacheContext> _condition_cache_ctx;
     std::unique_ptr<ORCFilterImpl> _orc_filter;
     orc::RowReaderOptions _row_reader_options;
 

Review Comment:
   **[Style]** Initializing `uint64_t` to `-1` gives `UINT64_MAX` 
(18446744073709551615). While it's overwritten before first use, this is a code 
smell and may trigger `-Wconversion` warnings in files that enable 
`compile_check_begin.h`.
   
   Consider using `0` as initial value, or a named sentinel constant if a 
specific "uninitialized" value is intended.



##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -505,13 +608,37 @@ Status RowGroupReader::_do_lazy_read(Block* block, size_t 
batch_size, size_t* re
         pre_read_rows = 0;
         pre_eof = false;
         FilterMap filter_map;
+        int64_t batch_base_row = _total_read_rows;
         RETURN_IF_ERROR(_read_column_data(block, 
_lazy_read_ctx.predicate_columns.first, batch_size,
                                           &pre_read_rows, &pre_eof, 
filter_map));
         if (pre_read_rows == 0) {
             DCHECK_EQ(pre_eof, true);

Review Comment:
   **[Performance observation]** In this lazy-read HIT path, predicate columns 
are already read from storage (`_read_column_data` above) before the cache 
check occurs. This means the cache HIT in the lazy-read path only saves the 
cost of:
   - Predicate evaluation
   - Lazy (non-predicate) column reads
   
   But the I/O cost for predicate columns is already spent. This contrasts with 
the non-lazy path which pre-filters `_read_ranges` to avoid reading any columns 
for false granules.
   
   This is noted as a design observation. Since this code is actually dead (see 
other comment about `_condition_cache_ctx` being null at this point), it's not 
a practical issue currently. But if this code path is ever activated, consider 
restructuring to check the cache before reading predicate columns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [feat](condition cache) Support condition cache for external table [doris]

Reply via email to