Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21559 )

Change subject: IMPALA-13161: Fix column index overflow in DelimitedTextParser
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21559/2/be/src/exec/delimited-text-parser.inline.h
File be/src/exec/delimited-text-parser.inline.h:

http://gerrit.cloudera.org:8080/#/c/21559/2/be/src/exec/delimited-text-parser.inline.h@76
PS2, Line 76:   if (column_idx_ < num_cols_) ++column_idx_;
> That file would be ambiguous if it had more than 1 column? How does the beh
The file won't be ambiguous. How to read and materialize the data is defined by 
the table schema (i.e. column types) and the tuple descriptor (i.e. what the 
query wants, mainly a bool array of 'is_materialized_col_').

How many columns in a text line is determined by the field delimiters. If one 
line has N field delimiters, then N+1 columns can be read out from that line. 
However, it depends on the table schema and tuple descriptor to select the 
columns. As this comment mentions, we use a two pass parsing approach:
https://github.com/apache/impala/blob/e2e45401e2bead4090fd5c562709db521cbc6d38/be/src/exec/hdfs-scanner.h#L60-L77

The first pass fills out the FieldLocation structs which contain where all the 
fields start and their lengths in the byte buffer.
The second pass uses the FieldLocation to materialize slots in the tuples.

AddColumn() is used in the first pass. If a text line has more columns than the 
table schema, they won't be added into the FieldLocations. This is judged by 
ReturnCurrentColumn():

  bool ReturnCurrentColumn() const {
    return column_idx_ < num_cols_ && is_materialized_col_[column_idx_];
  }

'column_idx_' is the index of the column in the text line. It starts from the 
number of partitions (not starts from 0). It's only used to compare with 
num_cols_ and num_partition_keys_ so we don't need to bump it if it's already 
larger than num_cols_.
The original code keeps bumping it even if it's overflowed. Then column_idx_ < 
num_cols_ passes and accessing is_materialized_col_[column_idx_] crashes. With 
this fix, column_idx_ won't be larger than num_cols_ so won't cause the crash. 
This is the only behavior change and it's not aware by the user or any caller 
of this function.



--
To view, visit http://gerrit.cloudera.org:8080/21559
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I527a8971e92e270d5576c2155e4622dd6d43d745
Gerrit-Change-Number: 21559
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Fri, 12 Jul 2024 06:27:42 +0000
Gerrit-HasComments: Yes

Reply via email to