AshinGau commented on code in PR #32785: URL: https://github.com/apache/doris/pull/32785#discussion_r1538491700
########## be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp: ########## @@ -88,24 +91,22 @@ Status ColumnChunkReader::next_page() { if (UNLIKELY(_remaining_num_values != 0)) { return Status::Corruption("Should skip current page"); } - RETURN_IF_ERROR(_page_reader->next_page_header()); - if (_page_reader->get_page_header()->type == tparquet::PageType::DICTIONARY_PAGE) { - // the first page maybe directory page even if _metadata.__isset.dictionary_page_offset == false, - // so we should parse the directory page in next_page() - RETURN_IF_ERROR(_decode_dict_page()); - // parse the real first data page - return next_page(); - } else if (_page_reader->get_page_header()->type == tparquet::PageType::DATA_PAGE_V2) { - _remaining_num_values = _page_reader->get_page_header()->data_page_header_v2.num_values; - _chunk_parsed_values += _remaining_num_values; - _state = HEADER_PARSED; - return Status::OK(); + + if (!_has_dict_parsed) { + _has_dict_parsed = true; + RETURN_IF_ERROR(_page_reader->load_page_header()); + if (_page_reader->get_page_header()->type == tparquet::PageType::DICTIONARY_PAGE) { + // the first page maybe directory page even if _metadata.__isset.dictionary_page_offset == false, + // so we should parse the directory page in next_page() + RETURN_IF_ERROR(_decode_dict_page()); + // parse the real first data page + return next_page(); + } } else { - _remaining_num_values = _page_reader->get_page_header()->data_page_header.num_values; - _chunk_parsed_values += _remaining_num_values; Review Comment: You'd better follow https://github.com/apache/doris/pull/28891 to understand why there should be a variable `_chunk_parsed_values` to record the parsed values. ########## be/src/vec/exec/format/parquet/vparquet_reader.cpp: ########## @@ -788,7 +788,7 @@ Status ParquetReader::_process_page_index(const tparquet::RowGroup& row_group, // use the union row range skipped_row_ranges.emplace_back(skipped_row_range); } - _col_offsets.emplace(parquet_col_id, offset_index); + _col_offsets.emplace(read_col, offset_index); Review Comment: Only columns in `ColumnValueRanges` can reach here, and should understand what `parquet_col_id` stands for. ########## be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp: ########## @@ -88,24 +91,22 @@ Status ColumnChunkReader::next_page() { if (UNLIKELY(_remaining_num_values != 0)) { return Status::Corruption("Should skip current page"); } - RETURN_IF_ERROR(_page_reader->next_page_header()); - if (_page_reader->get_page_header()->type == tparquet::PageType::DICTIONARY_PAGE) { - // the first page maybe directory page even if _metadata.__isset.dictionary_page_offset == false, - // so we should parse the directory page in next_page() - RETURN_IF_ERROR(_decode_dict_page()); - // parse the real first data page - return next_page(); - } else if (_page_reader->get_page_header()->type == tparquet::PageType::DATA_PAGE_V2) { - _remaining_num_values = _page_reader->get_page_header()->data_page_header_v2.num_values; - _chunk_parsed_values += _remaining_num_values; - _state = HEADER_PARSED; - return Status::OK(); + + if (!_has_dict_parsed) { Review Comment: Not all column chunk have dictionary page ########## be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp: ########## @@ -47,21 +47,24 @@ namespace doris::vectorized { ColumnChunkReader::ColumnChunkReader(io::BufferedStreamReader* reader, tparquet::ColumnChunk* column_chunk, FieldSchema* field_schema, - cctz::time_zone* ctz, io::IOContext* io_ctx) + cctz::time_zone* ctz, io::IOContext* io_ctx, + const tparquet::OffsetIndex* offset_index) Review Comment: Better to move `cctz::time_zone* ctz, io::IOContext* io_ctx` as the last two parameters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org