[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Mihaly Szjatinya (Code Review) Wed, 26 Mar 2025 06:37:29 -0700

Mihaly Szjatinya has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................


Patch Set 15:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/22049/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22049/14//COMMIT_MSG@22
PS14, Line 22: confusion of allowing both serde and tbl properties. (See related
> Can you mention IMPALA-13748?
Ack


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/runtime/descriptors.h@452
PS14, Line 452:   const std::string& avro_schema() const { return avro_schema_; 
}
> nit: extra line
Ack


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h
File be/src/util/char-codec.h:

http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@40
PS14, Line 40:
> nit: extra ,
Ack


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@47
PS14, Line 47: elimiter fou
> nit: progressively
Ack


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@57
PS14, Line 57:  EncodeBuffer(co
> this could be mover to mem_tracker.h, e.g. as MemTracker::ScopedAllocation
Ack


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@80
PS14, Line 80:
             :   /// Buffer to hold transformed data.
             :   uint8_t* out_buffer_ = nullptr;
             :
             :   /// Length of the output buffer.
             :   int64_t buffer_length_ = 0;
             : };
             : }
             :
             :
             :
             :
             :
             :
             :
             :
             :
> These are only used by DecodeBuffer - can you add a comment about that, e.g
Ack


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@77
PS8, Line 77:     memory_pool_->FreeAll();
            :   } else if (eosr) {
> I think that the current estimates are good enough.
Ack


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@113
PS8, Line 113:             (char*)prefix.data() + prefix.size(), encoding_, 
boost::locale::conv::stop);
> I think that we are still suboptimal here - for example if I understand cor
You're right. In case of decompression the decompressed buffer can be released. 
This is also where std::string for decoded data comes handy, allowing to free 
up the pool prior to allocating memory for decoded data and thus to get away 
without an additional pool.


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@25
PS14, Line 25: #include "runtime/mem-pool.h"
> nit: usually includes to Impala headers are sorted alphabetically
Ack


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@50
PS14, Line 50:     if (pool != nullptr) {
> Both branches mean that out_buffer_ is no longer usable, right? I think tha
Agree.


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@57
PS14, Line 57:
> Can you add a comment about reason for this size?
Ack


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@71
PS14, Line 71:     return Status(TErrorCode::CHARSET_CONVERSION_ERROR,
             :         "End of stream reached with partial symbol.");
> Isn't it possible for out_buffer_ to be set but not large enough? For examp
You are right, logic here doesn't make sense, I've probably just written smth 
to start with and forgot to get back to it later. My bad.

Approach you describe seems reasonable.


http://gerrit.cloudera.org:8080/#/c/22049/14/tests/query_test/test_charcodec.py
File tests/query_test/test_charcodec.py:

http://gerrit.cloudera.org:8080/#/c/22049/14/tests/query_test/test_charcodec.py@77
PS14, Line 77: lambda v: v.get_value('table_format').file_format == 'text'
> Is there any test for SequenceFile?
IIRC we have previously agreed to move SequenceFiles decoding to a separate 
jira, once this one is finalized. Then we'll also expand tests.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 15
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Comment-Date: Wed, 26 Mar 2025 13:37:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Reply via email to