(doris) branch refact_reader_branch updated: [parquet] Update new reader design docs

suxiaogang223 Wed, 27 May 2026 00:18:06 -0700

This is an automated email from the ASF dual-hosted git repository.

suxiaogang223 pushed a commit to branch refact_reader_branch
in repository https://gitbox.apache.org/repos/asf/doris.git



The following commit(s) were added to refs/heads/refact_reader_branch by this 
push:
     new e1ed7f204d4 [parquet] Update new reader design docs
e1ed7f204d4 is described below

commit e1ed7f204d4658035da0378d37ea9ae64b0ea737
Author: Socrates <[email protected]>
AuthorDate: Wed May 27 15:17:38 2026 +0800

    [parquet] Update new reader design docs
---
 docs/doris-arrow-parquet-reader-implementation.md | 427 +++++++++++-----------
 docs/doris-new-parquet-dictionary-pushdown.md     | 359 ++++++++++++++++++
 2 files changed, 582 insertions(+), 204 deletions(-)

diff --git a/docs/doris-arrow-parquet-reader-implementation.md 
b/docs/doris-arrow-parquet-reader-implementation.md
index b3bfe5c69e0..d191229e445 100644
--- a/docs/doris-arrow-parquet-reader-implementation.md
+++ b/docs/doris-arrow-parquet-reader-implementation.md
@@ -1,291 +1,310 @@
-# Doris New Parquet Reader Design And Status
+# Doris Arrow Parquet Reader 实现方案与当前状态
 
-This document describes the design and current implementation status of the new
-Parquet reader under `be/src/format/new_parquet/`.
+本文档描述 `be/src/format/new_parquet/` 下新 Parquet reader 的设计、当前实现状态和后续缺口。
 
-The goal of this PR is to build a file-local Parquet reader based on Arrow C++
-Parquet core APIs while keeping Doris-owned `Block` and `Column` as the scan
-output. It does not replace the old `vparquet` path yet.
+当前目标不是替换旧 `vparquet` 路径，而是在新 reader API 下先实现一个 file-local Parquet reader：
 
-## Design Goals
+- 底层复用 Arrow C++ Parquet core API 解析文件、row group 和 column chunk。
+- 输出仍然是 Doris 自己的 `Block` 和 `Column`。
+- 不使用 `parquet::arrow::FileReader`、`arrow::RecordBatch` 或 `arrow::Table` 作为 
scan 输出路径。
+- `ParquetReader` 只理解 Parquet file-local schema，不理解 Iceberg/global schema。
+- schema change、filter localization、default/generated/partition column 等 
table-level 语义放在 `TableReader` 和 `TableColumnMapper`。
 
-- Use Arrow C++ Parquet core APIs for Parquet file metadata, row group and 
column
-  decoding.
-- Keep `doris::parquet::ParquetReader` as a file-local reader.
-- Do not use `parquet::arrow::FileReader`, `arrow::RecordBatch`,
-  `arrow::Table` or `arrow::Array` as the scan output path.
-- Do not put Iceberg table schema, schema evolution, default columns, generated
-  columns or partition columns into `ParquetReader`.
-- Keep table schema mapping and filter localization in the reader/table layer,
-  especially `TableColumnMapper`.
-- Let the new implementation live in `be/src/format/new_parquet/` so it can
-  evolve independently from the old `be/src/format/parquet/` implementation.
+## 分层边界
 
-## Layering
+当前分层如下：
 
 ```text
-TableReader / IcebergTableReader
+FileScanner / TableReader / IcebergTableReader
     -> TableColumnMapper
     -> reader::FileScanRequest
     -> doris::parquet::ParquetReader
     -> DorisRandomAccessFile
     -> parquet::ParquetFileReader
     -> parquet::RowGroupReader
-    -> parquet::ColumnReader / parquet::internal::RecordReader
+    -> parquet::internal::RecordReader
     -> Doris Block / Column
 ```
 
-`ParquetReader` only consumes file-local information:
+关键边界：
 
-- file-local schema fields;
-- file-local projection columns;
-- file-local predicate columns;
-- file-local `ColumnPredicate` and `VExprContext` filters.
+- `TableReader` 输出 table/global schema block。
+- `ParquetReader` 输出 file-local block。
+- `TableColumnMapper` 负责把 table projection/filter 转成 file-local 
projection/filter。
+- `ParquetReader` 不补 default column，不物化 partition column，不处理 generated 
column，不做 Iceberg schema evolution。
+- 所有 table-level cast/finalize/delete/virtual column 都不能塞回 `ParquetReader`。
 
-Any table-level cast, default value, generated column, partition value, Iceberg
-field id mapping or schema evolution rule must be handled before or after the
-file reader layer.
+## FileReader 生命周期
 
-## Code Layout
+`ParquetReader` 继承 `reader::FileReader`，当前生命周期是：
+
+```text
+init(RuntimeState*)
+    -> get_schema(std::vector<SchemaField>*)
+    -> open(std::unique_ptr<FileScanRequest>&)
+    -> get_block(Block* file_block, size_t* rows, bool* eof)
+    -> close()
+```
+
+语义约束：
+
+- `init()` 打开物理文件并解析 Parquet footer metadata。
+- `get_schema()` 在 `init()` 成功后可调用，不要求 `open()`。
+- `open()` 接收已经 localize 的 `FileScanRequest`，并完成 row group pruning 和 reader 
游标初始化。
+- `get_block()` 只能在 `open()` 成功后调用，输出 file-local block。
+- `rows` 表示本批 file-local block 输出行数，`eof` 表示当前物理文件是否读完。
+
+## 代码布局
 
 ```text
 be/src/format/new_parquet/parquet_reader.h
 be/src/format/new_parquet/parquet_reader.cpp
 be/src/format/new_parquet/column_reader.h
 be/src/format/new_parquet/column_reader.cpp
+be/src/format/new_parquet/parquet_column_schema.h
+be/src/format/new_parquet/parquet_column_schema.cpp
+be/src/format/new_parquet/parquet_type.h
+be/src/format/new_parquet/parquet_type.cpp
 be/src/format/new_parquet/parquet_statistics.h
 be/src/format/new_parquet/parquet_statistics.cpp
+be/src/format/new_parquet/selection_vector.h
 ```
 
-`parquet_reader.*` owns file open, schema export, scan state, row group
-scheduling, predicate-first reading and output block assembly.
-
-`column_reader.*` owns Doris column assembly for one projected Parquet field. 
It
-wraps Arrow Parquet column-level APIs and converts decoded values into Doris
-columns.
+职责划分：
 
-`parquet_statistics.*` owns row group statistics pruning. Future page index,
-bloom filter and dictionary pruning should also live there rather than being
-mixed into the main scan loop.
+- `parquet_reader.*`：文件打开、schema 导出、scan state、row group 调度、谓词列优先读取、file-local 
block 组装。
+- `column_reader.*`：单个 Parquet 字段到 Doris column 的读取；封装 Arrow internal 
`RecordReader`。
+- `parquet_column_schema.*`：从 Parquet schema descriptor 构建 file-local schema 
tree。
+- `parquet_type.*`：解析 Parquet physical/logical/converted type，生成 Doris 
file-local type 和额外类型信息。
+- `parquet_statistics.*`：基于 row group metadata 做保守的统计信息裁剪。
+- `selection_vector.h`：表达 batch 内被选中的 row offset，用于延时物化。
 
-## Main Components
+## 核心组件
 
 ### DorisRandomAccessFile
 
-`DorisRandomAccessFile` adapts Doris `io::FileReader` to
-`arrow::io::RandomAccessFile`.
+`DorisRandomAccessFile` 把 Doris `io::FileReader` 适配成 
`arrow::io::RandomAccessFile`。
 
-It only handles random IO and file size lookup. It does not parse Parquet 
schema,
-does not evaluate filters, and does not carry table-level semantics.
+它只处理随机读和文件大小查询，不解析 Parquet schema，不携带 table schema，也不执行 filter。
 
 ### ParquetReaderScanState
 
-`ParquetReaderScanState` is an internal scan state stored in
-`parquet_reader.cpp`. It tracks:
+`ParquetReaderScanState` 是 `parquet_reader.cpp` 内部状态，记录：
 
-- Arrow random access file;
-- Arrow Parquet file reader and metadata;
-- Parquet schema descriptor;
-- selected row groups;
-- current row group reader;
-- current row group row offset;
-- projected file columns;
-- predicate columns and non-predicate output columns;
-- current row group column readers.
+- Arrow random access file；
+- Arrow Parquet file reader；
+- Parquet footer metadata；
+- Parquet schema descriptor；
+- file-local schema tree；
+- 被 row group statistics 选中的 row group；
+- 当前 row group reader；
+- 当前 row group 内已读行数；
+- predicate column readers；
+- non-predicate column readers。
 
-This state is intentionally private to the Parquet reader implementation.
+该状态不暴露给 table reader。
+
+### ParquetColumnSchema 和 ParquetTypeDescriptor
+
+`ParquetColumnSchema` 描述 file-local schema tree，包括：
+
+- Parquet node name；
+- Parquet field id；
+- top-level field id；
+- leaf column id；
+- Doris file-local type；
+- 子列 schema；
+- primitive column 的 `ParquetTypeDescriptor`。
+
+`ParquetTypeDescriptor` 负责保存 Parquet annotation 解析结果，包括：
+
+- physical type；
+- logical type / converted type 推导后的 Doris type；
+- decimal precision/scale；
+- time/timestamp unit；
+- 是否 string-like；
+- 是否支持当前 RecordReader 读取路径。
+
+类型解析已经从 `column_reader.cpp` 前移到 `parquet_type.*`，`ColumnReader` 热路径只消费解析结果。
 
 ### ParquetColumnReader
 
-`ParquetColumnReader` is Doris's file-local column reader abstraction. It is 
not
-the same as Arrow's `parquet::ColumnReader`.
+`ParquetColumnReader` 是 Doris 自己的 file-local column reader 抽象，不是 Arrow 的 
`parquet::ColumnReader`。
 
-Current implementations:
+当前接口收敛为：
 
-- `PrimitiveColumnReader`
-- `StructColumnReader`
+```text
+read(rows, column, rows_read)
+skip(rows)
+select(selection, selected_rows, batch_rows, column)
+```
+
+当前实现：
 
-`PrimitiveColumnReader` supports both the existing Arrow
-`parquet::TypedColumnReader` path and the new
-`parquet::internal::RecordReader` path for selected primitive reads.
+- `ScalarColumnReader`：基于 Arrow internal `RecordReader` 读取 flat 
primitive/string/decimal/time/timestamp。
+- `StructColumnReader`：递归读取 children，支持非常基础的 struct 组装。
 
-`StructColumnReader` currently supports basic required struct assembly by
-recursively reading child readers. Complex nested selective materialization is
-not complete.
+`select()` 在基类中统一实现：把 `SelectionVector` 合并成连续 row ranges，然后交替调用 `skip()` 和 
`read()`。当前不实现整批 read 后再 filter 的 fallback。
 
 ### ParquetColumnReaderFactory
 
-`ParquetColumnReaderFactory` creates Doris column readers from the current row
-group's Arrow Parquet readers and the file-local `ParquetColumnSchema`.
+`ParquetColumnReaderFactory` 根据当前 row group 和 `ParquetColumnSchema` 创建 column 
reader。
+
+它集中封装 Arrow internal `RecordReader` 的创建和缓存，避免 Arrow internal API 泄露到 
`ParquetReader` 主流程。
+
+### DataTypeSerDe decoded value 读取接口
+
+`ScalarColumnReader` 不直接把 Parquet value switch 到 Doris column，而是构造 
`DecodedColumnView`，再调用：
+
+```text
+DataTypeSerDe::read_column_from_decoded_values(...)
+```
+
+当前已接入的 SerDe 包括 number、string、decimal、date/time/datetime、nullable 
等类型。这样可以把“Parquet 解码”和“Doris 类型写入”拆开，减少 `ColumnReader` 内部的 Doris 类型分发逻辑。
+
+## Scan Request 语义
+
+新 reader 消费 `reader::FileScanRequest`。
+
+重要字段：
+
+- `predicate_columns`：需要先读取，用于计算 selection 的 file-local columns。
+- `non_predicate_columns`：selection 确定后再读取的 file-local columns。
+- `column_positions`：file column id 到 file-local output block position 的映射。
+- `local_filters`：已经 localize 到 file schema 的 filter。
+- `reader_expression_map`：table filter 无法安全转换成 file-local predicate 时的 
fallback 表达式。
 
-The factory centralizes reader construction so later work can add reader
-options, Dremel assemblers, selected-read policies and cache state without
-passing those details through free functions.
+输出 block 的列顺序和类型遵守 `column_positions`，不是 table/global schema。
 
-### ParquetStatisticsUtils
+## 谓词下推
 
-`ParquetStatisticsUtils` compiles file-local predicates into Parquet column
-predicate plans and evaluates row group metadata conservatively.
+当前已实现：
 
-It only understands Parquet file-local schema and Doris `ColumnPredicate`. It
-does not know Iceberg schema, slot descriptors or table schema mapping.
+- row group 级 min/max 统计信息裁剪；
+- null count 驱动的 `IS NULL` / `IS NOT NULL` 裁剪；
+- unsupported statistics、缺失 statistics、不安全比较时保守保留 row group。
 
-## Scan Request Semantics
+当前未实现：
 
-The new reader consumes `reader::FileScanRequest`.
+- page index pruning；
+- bloom filter pruning；
+- dictionary pruning；
+- batch 内直接执行结构化 `ColumnPredicate`；
+- `reader_expression_map` fallback 表达式执行。
 
-Important fields:
+注意：当前 `local_filters.predicates` 已经进入 row group statistics 路径，但在 batch 
内过滤阶段，`ParquetReader::_read_filter_columns()` 主要处理 
`local_filter.conjunct`。因此如果某个谓词只以 `ColumnPredicate` 形式存在，目前还缺少 batch 内二次过滤闭环。
 
-- `predicate_columns`: file-local columns that must be read first to evaluate
-  filters.
-- `non_predicate_columns`: file-local projection columns that are only read
-  after selection is known.
-- `projected_columns`: file-local columns that should appear in the output 
block.
-- `local_filters`: file-local filters produced by the table layer.
-- `reader_expression_map`: fallback expressions for filters that cannot be
-  represented as direct file-local predicates.
+## 延时物化当前状态
 
-The output block is still file-local. It is not a table/global schema block.
+当前 scan loop 是 predicate-first 模型：
 
-## Predicate Pushdown
+1. 读取 `predicate_columns`。
+2. 执行表达式 filter，生成 `SelectionVector`。
+3. 如果谓词列也在 output block 中，则复用已经解码的谓词列，并按 selection filter。
+4. 对 `non_predicate_columns` 调用 `ColumnReader::select()`，只读取被选中的行。
+5. 返回 file-local block。
 
-Doris new reader uses two existing filter representations:
+已有能力：
 
-- `ColumnPredicate`: structured single-column predicates, used for row group
-  statistics pruning and decoded value filtering.
-- `VExprContext`: expression filters, used for fallback expression evaluation
-  and residual filters.
+- flat primitive/string/decimal/time/timestamp 的基础 selected read；
+- empty selection 时跳过整批 non-predicate columns；
+- sparse selection 会被合并成多个连续 ranges；
+- predicate column 同时是 projection 时，不会重新读取该列。
 
-Current implementation status:
+主要缺口：
 
-- row group min/max pruning is wired through `parquet_statistics.*`;
-- supported stats types include boolean, int32, int64, float, double and
-  string/binary;
-- unsupported stats, missing stats or unsafe cases keep the row group;
-- `IS NULL` and `IS NOT NULL` pruning use null count when available;
-- page index, bloom filter and dictionary pruning are not implemented yet.
+- batch 内 `ColumnPredicate` 执行未接入 selection；
+- `reader_expression_map` 仍是 TODO；
+- selection index 当前是 `uint16_t`，需要显式约束 batch size；
+- selected read 依赖 Arrow internal `RecordReader::SkipRecords` 和 
`ReadRecords`，需要继续隔离在 `column_reader.*`；
+- 没有 page-level row range selection；
+- 复杂列延时物化尚未实现。
 
-Correctness rule: pruning must be conservative. If the reader cannot prove that
-a row group cannot match, it must keep the row group.
+## Schema Change 当前状态
 
-## Lazy Materialization
+当前原则是：`ParquetReader` 不理解 schema change，schema change 由 `TableColumnMapper` 和 
`TableReader` 处理。
 
-The scan loop follows a predicate-first model:
+已有能力：
 
-1. Read predicate columns.
-2. Evaluate `ColumnPredicate` and build a selection vector.
-3. If a predicate column is also projected, reuse the decoded predicate column.
-4. Read non-predicate output columns using the selection.
-5. Assemble the file-local output block in projected column order.
+- `TableReader` 初始化时默认使用 `TableColumnMappingMode::BY_FIELD_ID`。
+- `TableColumnMapper` 可以根据 table column 和 file schema 建立 `ColumnMapping`。
+- 缺失 partition column 可以用 partition value 生成 constant mapping。
+- 缺失普通列可以使用 `default_expr`。
+- file type 与 table type 不同的时候，可以生成 finalize cast projection。
+- virtual column 有 `ROW_ID` 和 `LAST_UPDATED_SEQUENCE_NUMBER` 的 mapping 标记。
 
-The current selected-read implementation uses Arrow Parquet
-`parquet::internal::RecordReader` for supported primitive columns.
+主要缺口：
 
-Why this is needed:
+- 当前 `SchemaField::id` 同时承担 file-local column id 和 mapping id，边界还不够清晰。尤其 
top-level primitive 目前会使用 leaf column id，Iceberg field id 映射还需要重新梳理。
+- `_is_same_type()` 只是 `DataTypePtr` 指针比较，不能可靠表达类型等价。
+- filter localization 仍是 stub，没有完整实现 trivial mapping、safe cast、reader 
expression fallback、finalize-only filter。
+- `reader_filter_expr` 没有真正生成或执行。
+- 复杂列 schema change 没有 child-level mapping。
+- `IcebergTableReader` 的 equality delete、position delete、virtual 
column、finalize 仍是框架 stub。
 
-- `parquet::TypedColumnReader::Skip` skips physical values, not SQL rows.
-- For nullable columns, row count and physical value count differ.
-- For repeated/nested columns, a row can contain multiple physical values.
+## 复杂列当前状态
 
-`RecordReader::SkipRecords` and `RecordReader::ReadRecords` provide row-level
-movement. Doris compresses the selection vector into row ranges and alternates
-skip/read operations.
+已有能力：
 
-Current support:
+- schema builder 能识别 `STRUCT`、`LIST`、`MAP`。
+- 可以把复杂 Parquet schema 组合成 Doris 
`DataTypeStruct`、`DataTypeArray`、`DataTypeMap`。
+- `StructColumnReader` 可以递归读取 children，支持非常基础的非 nullable struct。
 
-- selected read for primitive boolean, int32, int64, float and double when the
-  RecordReader path is available;
-- fallback path reads the whole batch and filters it when selected read is not
-  supported;
-- output columns are skipped when the selection is empty;
-- predicate columns are reused when they are also projected.
+主要缺口：
 
-Limitations:
+- nullable struct 未实现。
+- list reader 未实现。
+- map reader 未实现。
+- repeated / nested definition level assembler 未实现。
+- primitive reader 当前只支持 `max_repetition_level == 0 && max_definition_level <= 
1` 的 RecordReader 路径。
+- 复杂列裁剪未实现。
+- 复杂列延时物化未实现。
+- 复杂列 schema evolution / child remap 未实现。
 
-- `parquet::internal::RecordReader` is an Arrow internal/experimental API, so 
it
-  must remain hidden behind Doris `ParquetColumnReader`;
-- string, decimal and timestamp selected reads still need broader validation;
-- nested selected materialization needs a dedicated Dremel assembler.
+结论：当前复杂列“schema 可见”，但“读取能力不完整”。真正可用还需要实现 Dremel assembler 或等价的 nested column 
assembler。
 
-## Type Coverage
+## 当前可用能力总结
 
-Currently implemented:
+当前新 reader 已经具备：
 
-- flat required and nullable boolean;
-- flat required and nullable int32 / int64;
-- flat required and nullable float / double;
-- BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY string/binary with Doris-owned memory;
-- decimal with precision up to 38 for int32, int64, byte array and fixed-length
-  byte array physical encodings;
-- INT64 timestamp millis and micros into Doris `DateTimeV2`;
-- basic required struct assembly.
+- 打开 Parquet 文件并解析 footer；
+- 导出 file-local schema；
+- 基于 row group statistics 做保守裁剪；
+- 读取 flat required/nullable primitive；
+- 读取 string/binary；
+- 读取 decimal precision <= 38 的常见物理编码；
+- 读取 date/time/datetime 的部分编码；
+- 通过 `DataTypeSerDe::read_column_from_decoded_values()` 写入 Doris column；
+- 基础 predicate-first scan；
+- flat column selected read；
+- 非 nullable struct 的初步读取框架。
 
-Not implemented or incomplete:
+当前还不具备完整生产能力，尤其缺少：
 
-- INT96 timestamp;
-- nanosecond timestamp;
-- TIMESTAMPTZ semantics;
-- DECIMAL256;
-- nullable struct;
-- list and map;
-- complex column pruning;
-- complex column lazy materialization.
+- schema change 的完整 field id 语义；
+- filter localization 的完整实现；
+- batch 内 `ColumnPredicate` 执行；
+- `reader_expression_map`；
+- page index / bloom filter / dictionary pruning；
+- list/map/nullable struct；
+- nested column pruning；
+- nested lazy materialization；
+- 充分单测覆盖。
 
-## Current Implementation Status
+## 下一步优先级
 
-Implemented in this PR:
+建议按以下顺序推进：
 
-- new `new_parquet` module;
-- Arrow-backed Parquet file open and metadata read;
-- file-local schema export;
-- row group scheduling;
-- projected leaf reader creation;
-- primitive column decoding into Doris columns;
-- string, decimal and INT64 timestamp decoding;
-- basic struct reader;
-- row group statistics pruning skeleton and initial implementation;
-- predicate-first scan flow;
-- primitive RecordReader-backed selected materialization;
-- Debug BE build fixes.
+1. 收敛 `SchemaField` 和 `ColumnMapping` 的 id 语义，区分 Iceberg field id、Parquet leaf 
column id 和 file-local output position。
+2. 补齐 batch 内 `ColumnPredicate` 执行，让 row group pruning 之后仍有正确 residual filter。
+3. 实现 `reader_expression_map`，支撑 schema change 下无法安全下推的 filter fallback。
+4. 补 flat primitive/string/decimal/timestamp 的 selected read 单测。
+5. 实现 nullable struct，再实现 list/map assembler。
+6. 在复杂列 assembler 稳定后，再做 nested pruning 和 nested lazy materialization。
+7. 后续再接 page index、bloom filter、dictionary pruning。
 
-Validated:
-
-- `git diff --check`;
-- `BUILD_TYPE=DEBUG ./build.sh --be` on
-  `fedora:/home/socrates/code/doris`.
-
-## Future Work
+## 核心规则
 
-Near term:
-
-- add unit tests for primitive required/nullable selected reads;
-- validate selection edge cases: empty selection, full selection, sparse
-  selection and highly fragmented ranges;
-- add a selection-rate policy so dense selections can fall back to whole-batch
-  read plus filter;
-- stabilize string, decimal and timestamp selected reads;
-- keep Arrow internal API usage isolated in `column_reader.*`.
-
-Mid term:
+`ParquetReader` 必须保持 file-local reader。
 
-- implement page index pruning in `parquet_statistics.*`;
-- implement bloom filter pruning for equality predicates;
-- add dictionary-aware filtering where Arrow exposes enough metadata safely;
-- expand complex type assembly for nullable struct, list and map;
-- add tests for row group pruning correctness and unsupported-type fallback.
-
-Long term:
-
-- support nested column pruning;
-- support nested lazy materialization;
-- support page-level row range selection;
-- integrate the new file-local reader with table readers after the API boundary
-  is stable;
-- keep old `vparquet` compatibility until the new path is functionally 
complete.
-
-## Key Rule
-
-`ParquetReader` must remain a file-local reader. If a feature requires table
-schema, Iceberg schema evolution, partition values, default/generated columns 
or
-final table output semantics, it belongs in `TableColumnMapper` or
-`TableReader`, not in `be/src/format/new_parquet/`.
+只要某个功能需要 table schema、Iceberg schema evolution、partition 
value、default/generated column、delete file 或最终 table block 语义，就应该放在 
`TableColumnMapper`、`TableReader` 或具体 table reader 中，而不是放进 
`be/src/format/new_parquet/`。
diff --git a/docs/doris-new-parquet-dictionary-pushdown.md 
b/docs/doris-new-parquet-dictionary-pushdown.md
new file mode 100644
index 00000000000..7ce6b1a12c3
--- /dev/null
+++ b/docs/doris-new-parquet-dictionary-pushdown.md
@@ -0,0 +1,359 @@
+# Doris New Parquet Reader Dictionary Predicate Pushdown 方案
+
+## 背景
+
+当前 new parquet reader 位于 `be/src/format/new_parquet/`，读取路径基于 Arrow
+Parquet core API，并输出 Doris `Block` / `Column`。
+
+当前已经实现的谓词相关能力主要有两类：
+
+- row group 级 min/max/null statistics 裁剪；
+- 读取谓词列后，用 Doris `ColumnPredicate` 生成 `SelectionVector`，再对非谓词列做延时物化。
+
+但当前还没有实现 dictionary predicate pushdown。主要原因是
+`ParquetColumnReaderFactory` 创建 Arrow `RecordReader` 时使用：
+
+```cpp
+_row_group->RecordReader(leaf_column_id, /*read_dictionary=*/false);
+```
+
+因此底层会把字典编码列直接解码成普通值。等 `ParquetReader` 执行
+`ColumnPredicate::evaluate()` 时，已经看不到 dictionary page，也看不到 dictionary id。
+
+本文档描述后续在 new parquet reader 中实现字典列谓词下推的设计方案。
+
+## 目标
+
+字典谓词下推的目标不是替代现有 statistics pruning，而是补充一类更强的过滤能力：
+
+```sql
+where c = 'abc'
+where c in ('a', 'b', 'c')
+where c != 'x'
+```
+
+如果 Parquet column chunk 是全字典编码，可以只检查 dictionary values 或 dictionary
+ids，而不必先把整列解码成字符串列。
+
+预期收益：
+
+- 在 row group 级提前跳过不可能命中的 row group；
+- 在 batch 级避免谓词列 string materialization；
+- 和现有 `SelectionVector` / 延时物化路径结合，减少非谓词列读取量。
+
+## 当前实现状态
+
+### 已具备
+
+- `ParquetStatisticsUtils` 已经有 file-local `ParquetColumnPredicate` 计划结构。
+- `ParquetReader` 已经有谓词列优先读取流程。
+- `SelectionVector` 已经能表示 batch 内选中 row offset。
+- `ParquetColumnReader::select()` 已经能按 selection 对非谓词列做 selected read。
+
+### 不具备
+
+- 没有判断 column chunk 是否全字典编码。
+- 没有读取 dictionary page 并转换成 Doris Column 的接口。
+- 没有 dictionary id reader。
+- 没有 dictionary value 到 dict id 的谓词重写。
+- 没有把 dictionary id selection 接入当前 `SelectionVector`。
+
+因此当前实现不能利用字典列谓词下推。
+
+## 分层原则
+
+字典谓词下推必须保持 file-local 语义：
+
+- `TableColumnMapper` 负责把 table filter 转换成 file-local `ColumnPredicate`。
+- `ParquetReader` 只消费 file-local `FileScanRequest`。
+- 字典页、encoding、dictionary id 都属于 Parquet 文件格式层，不能泄露到
+  Iceberg/table schema 层。
+
+建议放置位置：
+
+```text
+be/src/format/new_parquet/parquet_statistics.*
+    row group 级 dictionary pruning
+
+be/src/format/new_parquet/column_reader.*
+    dictionary values / dictionary ids 读取能力
+
+be/src/format/new_parquet/parquet_reader.cpp
+    将 dictionary selection 接入现有 predicate-first scan loop
+```
+
+## 方案一：Row Group 级字典裁剪
+
+### 思路
+
+对于全字典编码的 column chunk，dictionary page 包含该 row group 中所有可能出现的非
+NULL 值。如果所有 dictionary values 都不能满足谓词，则整个 row group 可以跳过。
+
+例子：
+
+```text
+predicate: name = 'Bob'
+dictionary values: ['Alice', 'Cindy']
+
+=> dictionary 中没有任何值满足 name = 'Bob'
+=> row group 可以跳过
+```
+
+### 流程
+
+```text
+FileScanRequest.local_filters
+    -> Build ParquetColumnPredicate
+    -> 对每个 row group / column chunk：
+       1. 判断 column chunk 是否全字典编码
+       2. 读取 dictionary page
+       3. 将 dictionary values materialize 成 Doris Column
+       4. 对 dictionary values 执行 ColumnPredicate
+       5. 如果没有任何 dictionary value 命中，则跳过 row group
+```
+
+### 全字典编码判断
+
+Parquet 允许同一个 column chunk 先使用字典编码，后续 fallback 到 plain encoding。
+这种 mixed encoding 不能用于 row group 级字典裁剪，否则会漏读 plain page 中的值。
+
+判断方式可以参考旧 `vparquet`：
+
+- 优先使用 `encoding_stats`：
+  - 所有 `DATA_PAGE` 必须是 `PLAIN_DICTIONARY` 或 `RLE_DICTIONARY`；
+  - 不能存在 count > 0 的非字典 data page。
+- 如果没有 `encoding_stats`，退化检查 `encodings`：
+  - 必须包含 dictionary encoding；
+  - 除 dictionary encoding、`RLE`、`BIT_PACKED` 外，不能包含其它 data encoding。
+
+需要注意：`RLE` / `BIT_PACKED` 可能用于 definition/repetition levels，不代表 value
+不是字典编码。
+
+### 支持的谓词
+
+第一阶段建议只支持结构化 `ColumnPredicate`：
+
+- `EQ`
+- `IN`
+- `NE`
+- `NOT IN`
+- `IS NULL`
+- `IS NOT NULL`
+
+其中 null 语义需要谨慎：
+
+- dictionary page 不包含 NULL；
+- `IS NULL` / `IS NOT NULL` 仍需要结合 column chunk null count；
+- 不能仅靠 dictionary values 判断 NULL 谓词。
+
+更复杂的表达式型 filter，例如 `lower(name) = 'abc'`，不在第一阶段支持。
+
+### 正确性规则
+
+row group 级裁剪必须保守：
+
+- 不能确认全字典编码时，保留 row group；
+- 不能读取 dictionary page 时，保留 row group；
+- 谓词类型不支持时，保留 row group；
+- 类型转换不安全时，保留 row group；
+- NULL 语义不能确认时，保留 row group。
+
+## 方案二：Batch 级 Dict Id Selection
+
+### 思路
+
+row group 不能整体跳过时，仍可以避免把谓词列完整解码成字符串列。
+
+例子：
+
+```text
+dictionary values:
+  id 0 -> 'Alice'
+  id 1 -> 'Bob'
+  id 2 -> 'Cindy'
+
+predicate:
+  name = 'Bob'
+
+matched dict ids:
+  {1}
+
+data page ids:
+  [0, 1, 1, 2, 0]
+
+selection:
+  [1, 2]
+```
+
+这时谓词列只需要扫描 dictionary ids，不需要 materialize 成 `ColumnString`。
+非谓词列继续复用当前 `SelectionVector` 做延时物化。
+
+### 流程
+
+```text
+打开 row group
+    -> 对字典谓词列读取 dictionary values
+    -> 对 dictionary values 执行 ColumnPredicate
+    -> 得到 matched dict id set
+
+读取 batch
+    -> 读取该 batch 的 dictionary ids
+    -> 用 matched dict id set 生成 SelectionVector
+    -> 非谓词列按 SelectionVector selected read
+    -> 如果字典谓词列也在 projection 中，再按需转换成真实值列
+```
+
+### Reader 抽象
+
+建议在 `column_reader.*` 增加独立 reader 分支，而不是把逻辑塞进
+`PrimitiveColumnReader::read()`：
+
+```text
+ParquetColumnReader
+    PrimitiveColumnReader
+    DictionaryColumnReader
+```
+
+或者先不新增类，通过内部 strategy 表达：
+
+```text
+PrimitiveColumnReader
+    decoded reader path
+    dictionary reader path
+```
+
+需要暴露的能力：
+
+```text
+read_dictionary_values(MutableColumnPtr* values)
+read_dictionary_ids(int64_t rows, MutableColumnPtr* ids, int64_t* rows_read)
+select_by_dictionary_ids(...)
+materialize_dictionary_ids(...)
+```
+
+具体命名可以在实现时收敛，但边界应保持：
+
+- dictionary values / ids 读取属于 `column_reader.*`；
+- 用谓词生成 matched dict ids 属于 `parquet_statistics.*` 或新的 filter helper；
+- 将 selection 接入 scan loop 属于 `parquet_reader.cpp`。
+
+### Arrow RecordReader 的限制
+
+Arrow Parquet `RecordReader` 有 `read_dictionary` 参数和 `ReadDictionary()` API。
+但当前代码用的是 `read_dictionary=false`。
+
+后续可以尝试：
+
+```cpp
+_row_group->RecordReader(leaf_column_id, /*read_dictionary=*/true)
+```
+
+需要验证：
+
+- 只有全字典编码 column chunk 是否才会暴露 dictionary ids；
+- mixed encoding 是否自动 fallback 为 decoded values；
+- `RecordReader::read_dictionary()` 是否能可靠表示当前 reader 是否真的在读 ids；
+- `BYTE_ARRAY` / `FIXED_LEN_BYTE_ARRAY` 之外的类型支持情况；
+- nullable column 下 ids 和 def levels 的行对齐方式。
+
+从 Arrow 头文件注释看，dictionary expose 主要是 experimental API，且对 fully
+dictionary encoded byte array column chunk 更可靠。因此第一版实现应该只针对 string-like
+列，并且必须有 fallback。
+
+## 和旧 vparquet 的关系
+
+旧 `vparquet` 已经实现了一套字典过滤思路：
+
+1. 判断 column chunk 是否全字典编码；
+2. 读取 dictionary values 到临时 string column；
+3. 执行原始谓词；
+4. 将命中的 dictionary value 下标重写成 int dict code 谓词；
+5. 读取 data page 时输出 dict id column；
+6. 最终需要输出该列时再把 dict id 转回 string。
+
+new parquet reader 可以复用这个设计思想，但不建议直接复用旧实现代码：
+
+- 旧实现基于 Doris 自研 page decoder；
+- new parquet reader 当前基于 Arrow Parquet core API；
+- new reader 已有 `SelectionVector`，可以直接用 dict ids 生成 selection，而不一定要重写成
+  `VExprContext`。
+
+更适合 new reader 的方式是：
+
+```text
+dictionary values -> ColumnPredicate -> matched dict id set -> SelectionVector
+```
+
+而不是：
+
+```text
+dictionary values -> VExprContext -> rewrite predicate expression
+```
+
+## 推荐实施顺序
+
+### 阶段一：Metadata 判断和 Row Group 级 Dictionary Pruning
+
+新增能力：
+
+- 判断 column chunk 是否全字典编码；
+- 为 string-like primitive column 读取 dictionary values；
+- 对 dictionary values 执行 `ColumnPredicate`；
+- 在 `ParquetStatisticsUtils::SelectRowGroups()` 中额外执行 dictionary pruning。
+
+约束：
+
+- 只支持 `BYTE_ARRAY` / `FIXED_LEN_BYTE_ARRAY` string-like 列；
+- 只支持结构化 `ColumnPredicate`；
+- 不处理 expression fallback；
+- 不处理 mixed encoding；
+- 不能确认时保守保留 row group。
+
+### 阶段二：Batch 级 Dict Id Selection
+
+新增能力：
+
+- 构造 dictionary-aware predicate column reader；
+- 读取 batch dictionary ids；
+- 用 matched dict id set 生成 `SelectionVector`；
+- 和现有延时物化路径合并。
+
+约束：
+
+- 谓词列如果也在 projection 中，需要按 selection materialize 成真实 Doris column；
+- dict id column 不应泄露到 `ParquetReader` 输出 block；
+- fallback 到 decoded value path 必须保持正确。
+
+### 阶段三：扩展类型和复杂谓词
+
+后续再考虑：
+
+- numeric dictionary；
+- decimal dictionary；
+- timestamp/date dictionary；
+- `LIKE` / prefix filter；
+- expression fallback；
+- page index + dictionary 组合裁剪。
+
+## 当前实现是否可以直接做到
+
+不能。
+
+当前实现缺少以下关键点：
+
+- `RecordReader` 使用 `read_dictionary=false`；
+- 没有 dictionary metadata 判断；
+- 没有 dictionary page 读取接口；
+- 没有 dict id column 或 dict id selection；
+- 谓词过滤发生在已经 materialize 的 Doris Column 上。
+
+因此，当前最多只能做 decoded value filter，不能做 dictionary predicate pushdown。
+
+## 关键设计结论
+
+- 字典优化应该放在 Parquet file-local 层，不进入 table schema / Iceberg 层。
+- 第一阶段优先做 row group 级 dictionary pruning，收益明确且风险低。
+- 第二阶段再做 batch 级 dict id selection，与现有 `SelectionVector` 和延时物化结合。
+- 基于 Arrow Parquet API 时，必须明确 fallback 策略，不能假设所有字典编码列都能暴露
+  dictionary ids。
+- 输出 block 必须始终是正常 Doris Column，不能把 dict id column 暴露给上层。


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris) branch refact_reader_branch updated: [parquet] Update new reader design docs

Reply via email to