Mihaly Szjatinya has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 15: (13 comments) http://gerrit.cloudera.org:8080/#/c/22049/14//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22049/14//COMMIT_MSG@22 PS14, Line 22: confusion of allowing both serde and tbl properties. (See related > Can you mention IMPALA-13748? Ack http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/runtime/descriptors.h@452 PS14, Line 452: const std::string& avro_schema() const { return avro_schema_; } > nit: extra line Ack http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h File be/src/util/char-codec.h: http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@40 PS14, Line 40: > nit: extra , Ack http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@47 PS14, Line 47: elimiter fou > nit: progressively Ack http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@57 PS14, Line 57: EncodeBuffer(co > this could be mover to mem_tracker.h, e.g. as MemTracker::ScopedAllocation Ack http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@80 PS14, Line 80: : /// Buffer to hold transformed data. : uint8_t* out_buffer_ = nullptr; : : /// Length of the output buffer. : int64_t buffer_length_ = 0; : }; : } : : : : : : : : : > These are only used by DecodeBuffer - can you add a comment about that, e.g Ack http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@77 PS8, Line 77: memory_pool_->FreeAll(); : } else if (eosr) { > I think that the current estimates are good enough. Ack http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@113 PS8, Line 113: (char*)prefix.data() + prefix.size(), encoding_, boost::locale::conv::stop); > I think that we are still suboptimal here - for example if I understand cor You're right. In case of decompression the decompressed buffer can be released. This is also where std::string for decoded data comes handy, allowing to free up the pool prior to allocating memory for decoded data and thus to get away without an additional pool. http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@25 PS14, Line 25: #include "runtime/mem-pool.h" > nit: usually includes to Impala headers are sorted alphabetically Ack http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@50 PS14, Line 50: if (pool != nullptr) { > Both branches mean that out_buffer_ is no longer usable, right? I think tha Agree. http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@57 PS14, Line 57: > Can you add a comment about reason for this size? Ack http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@71 PS14, Line 71: return Status(TErrorCode::CHARSET_CONVERSION_ERROR, : "End of stream reached with partial symbol."); > Isn't it possible for out_buffer_ to be set but not large enough? For examp You are right, logic here doesn't make sense, I've probably just written smth to start with and forgot to get back to it later. My bad. Approach you describe seems reasonable. http://gerrit.cloudera.org:8080/#/c/22049/14/tests/query_test/test_charcodec.py File tests/query_test/test_charcodec.py: http://gerrit.cloudera.org:8080/#/c/22049/14/tests/query_test/test_charcodec.py@77 PS14, Line 77: lambda v: v.get_value('table_format').file_format == 'text' > Is there any test for SequenceFile? IIRC we have previously agreed to move SequenceFiles decoding to a separate jira, once this one is finalized. Then we'll also expand tests. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 15 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Comment-Date: Wed, 26 Mar 2025 13:37:17 +0000 Gerrit-HasComments: Yes