[ 
https://issues.apache.org/jira/browse/IMPALA-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955584#comment-17955584
 ] 

ASF subversion and git services commented on IMPALA-14100:
----------------------------------------------------------

Commit 4837cedc795017aa1e7b69ef0914020a3022ca88 in impala's branch 
refs/heads/master from Mihaly Szjatinya
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=4837cedc7 ]

IMPALA-10319: Support arbitrary encodings on Text files

As proposed in Jira, this implements decoding and encoding of text
buffers for Impala/Hive text tables. Given a table with
'serialization.encoding' property set, similarly to Hive, Impala should
be able to encode the inserted data into charset specified, consequently
saving it into a text file. The opposite decoding operation should be
performed upon reading data buffers from text files. Both operations
employ boost::locale::conv library.

Since Hive doesn't encode line delimiters, charsets that would have
delimiters stored differently from ASCII are not allowed.

One difference from Hive is that Impala implements
'serialization.encoding' only as a per partition serdeproperty to avoid
confusion of allowing both serde and tbl properties. (See related
IMPALA-13748)

Note: Due to precreated non-UTF-8 files present in the patch
'gerrit-code-review-checks' was performed locally. (See IMPALA-14100)

Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Reviewed-on: http://gerrit.cloudera.org:8080/22049
Reviewed-by: Csaba Ringhofer <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> critique-gerrit-review.py crashes with a codec exception when reviewing a 
> diff containing data with non-UTF-8 encoding
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-14100
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14100
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>            Reporter: Laszlo Gaal
>            Priority: Major
>
> The precommit checker script {{bin/jenkins/critique-gerrit-review.py}} can 
> crash with the following Python traceback when the change diff contains data 
> with an encoding different from UTF-8. This can happen when prebuilt data 
> files are supplied with a patch, as it happened with 
> https://gerrit.cloudera.org/c/22049/ for example.
> {code}
> 10:34:47.030720 git.c:439               trace: built-in: git diff -U0 
> HEAD^..HEAD
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/gerrit-auto-critic-test/Impala/bin/jenkins/critique-gerrit-review.py",
>  line 491, in <module>
>     merge_comments(comments, get_misc_comments(base_revision, revision, 
> args.dryrun))
>   File 
> "/var/lib/jenkins/workspace/gerrit-auto-critic-test/Impala/bin/jenkins/critique-gerrit-review.py",
>  line 209, in get_misc_comments
>     diff = check_output(["git", "diff", "-U0", 
> "{0}..{1}".format(base_revision, revision)],
>   File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
>     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
>   File "/usr/lib/python3.8/subprocess.py", line 495, in run
>     stdout, stderr = process.communicate(input, timeout=timeout)
>   File "/usr/lib/python3.8/subprocess.py", line 1015, in communicate
>     stdout = self.stdout.read()
>   File "/usr/lib/python3.8/codecs.py", line 322, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 34006: 
> invalid start byte
> {code}
> Excluding the problematic file(s) in 
> https://github.com/apache/impala/blob/f4e75510948bdb72f2d5206161fee12e5b6d0888/bin/jenkins/critique-gerrit-review.py#L68-L77
>  does not help, as the crash happens when processing the output of {{git 
> diff}}, which returnsa single output stream containing all the changes in all 
> the files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to