github-actions[bot] commented on code in PR #60897:
URL: https://github.com/apache/doris/pull/60897#discussion_r2884565191
##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -505,13 +608,37 @@ Status RowGroupReader::_do_lazy_read(Block* block, size_t
batch_size, size_t* re
pre_read_rows = 0;
pre_eof = false;
FilterMap filter_map;
+ int64_t batch_base_row = _total_read_rows;
RETURN_IF_ERROR(_read_column_data(block,
_lazy_read_ctx.predicate_columns.first, batch_size,
&pre_read_rows, &pre_eof,
filter_map));
if (pre_read_rows == 0) {
DCHECK_EQ(pre_eof, true);
break;
}
pre_raw_read_rows += pre_read_rows;
+
+ // Condition cache HIT: check if all rows in this batch are in false
granules
+ bool cache_skip_batch = false;
+ if (_condition_cache_ctx && _condition_cache_ctx->is_hit) {
+ auto& cache = *_condition_cache_ctx->filter_result;
+ int64_t first_rg_pos =
_read_ranges.get_row_index_by_pos(batch_base_row);
+ int64_t last_rg_pos =
+ _read_ranges.get_row_index_by_pos(batch_base_row +
pre_read_rows - 1);
Review Comment:
**[Dead code]** This lazy-read HIT check (`if (_condition_cache_ctx &&
_condition_cache_ctx->is_hit)`) will never evaluate to true for Parquet.
In `_filter_read_ranges_by_condition_cache()` (called during row group
init), `_condition_cache_ctx` is set to `nullptr` after filtering
`_read_ranges` on HIT. So by the time execution reaches this lazy-read path,
`_condition_cache_ctx` is already null.
The actual HIT optimization for Parquet is entirely handled by pre-filtering
`_read_ranges` in the non-lazy path, which is correct and effective. This block
(lines 616-641 approximately) is unreachable dead code.
Consider either:
1. Removing this dead code block, or
2. If there's an intent to do per-batch cache checking in the lazy path
(e.g., for future use), don't null out `_condition_cache_ctx` in
`_filter_read_ranges_by_condition_cache()` and document the design.
##########
be/src/olap/rowset/segment_v2/row_ranges.h:
##########
@@ -275,6 +321,23 @@ class RowRanges {
_count += range_to_add.count();
}
+ // Returns the row index (within the original row space) of the pos-th
element
+ // across all ranges. For example, if ranges are [0,3000) and [8000,11000),
+ // pos=0 returns 0, pos=2999 returns 2999, pos=3000 returns 8000.
+ int64_t get_row_index_by_pos(int64_t pos) const {
+ size_t remaining = pos;
+ for (const auto& range : _ranges) {
+ size_t range_len = range.count();
+ if (remaining < range_len) {
+ return range.from() + remaining;
+ }
+ remaining -= range_len;
+ }
+ // pos is out of bounds; return -1 to indicate invalid
+ DCHECK(false) << "pos " << pos << " is out of bounds for RowRanges
with count " << _count;
+ return -1;
+ }
+
uint64_t get_digest(uint64_t seed) const {
Review Comment:
**[Bug risk]** `DCHECK(false)` vanishes in RELEASE builds, causing silent
return of -1 for out-of-bounds positions. Per AGENTS.md coding standards,
invariant violations should crash rather than silently continue.
If `pos >= _count` is truly an invariant that should never be violated, this
should use `DORIS_CHECK(false)` (or equivalently, `DORIS_CHECK(pos < _count)`
at the function entry) to catch violations in all build types. A silent return
of -1 in production could cause incorrect granule marking and data correctness
issues.
```cpp
// Suggested: replace DCHECK with DORIS_CHECK
DORIS_CHECK(false) << "pos " << pos << " is out of bounds for RowRanges with
count " << _count;
```
##########
be/src/vec/exec/format/orc/vorc_reader.h:
##########
@@ -700,6 +717,9 @@ class OrcReader : public GenericReader {
std::unique_ptr<orc::ColumnVectorBatch> _batch;
std::unique_ptr<orc::Reader> _reader = nullptr;
std::unique_ptr<orc::RowReader> _row_reader;
+
+ uint64_t _last_read_row_number = -1;
+ std::shared_ptr<ConditionCacheContext> _condition_cache_ctx;
std::unique_ptr<ORCFilterImpl> _orc_filter;
orc::RowReaderOptions _row_reader_options;
Review Comment:
**[Style]** Initializing `uint64_t` to `-1` gives `UINT64_MAX`
(18446744073709551615). While it's overwritten before first use, this is a code
smell and may trigger `-Wconversion` warnings in files that enable
`compile_check_begin.h`.
Consider using `0` as initial value, or a named sentinel constant if a
specific "uninitialized" value is intended.
##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -505,13 +608,37 @@ Status RowGroupReader::_do_lazy_read(Block* block, size_t
batch_size, size_t* re
pre_read_rows = 0;
pre_eof = false;
FilterMap filter_map;
+ int64_t batch_base_row = _total_read_rows;
RETURN_IF_ERROR(_read_column_data(block,
_lazy_read_ctx.predicate_columns.first, batch_size,
&pre_read_rows, &pre_eof,
filter_map));
if (pre_read_rows == 0) {
DCHECK_EQ(pre_eof, true);
Review Comment:
**[Performance observation]** In this lazy-read HIT path, predicate columns
are already read from storage (`_read_column_data` above) before the cache
check occurs. This means the cache HIT in the lazy-read path only saves the
cost of:
- Predicate evaluation
- Lazy (non-predicate) column reads
But the I/O cost for predicate columns is already spent. This contrasts with
the non-lazy path which pre-filters `_read_ranges` to avoid reading any columns
for false granules.
This is noted as a design observation. Since this code is actually dead (see
other comment about `_condition_cache_ctx` being null at this point), it's not
a practical issue currently. But if this code path is ever activated, consider
restructuring to check the cache before reading predicate columns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]