Re: RFR: 8354266: Fix non-UTF-8 text encoding

Raffaello Giulietti Thu, 10 Apr 2025 04:49:57 -0700

On Thu, 10 Apr 2025 10:14:40 GMT, Magnus Ihse Bursie <i...@openjdk.org> wrote:


>> I have checked the entire code base for incorrect encodings, but luckily 
>> enough these were the only remaining problems I found. 
>> 
>> BOM (byte-order mark) is a method used for distinguishing big and little 
>> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is 
>> discouraged. In the words of the Unicode Consortium: "Use of a BOM is 
>> neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful 
>> of files. These should be removed.
>> 
>> Methodology used: 
>> 
>> I have run four different tools for using different heuristics for 
>> determining the encoding of a file:
>> * chardetect (the original, slow-as-molasses Perl program, which also had 
>> the worst performing heuristics of all; I'll rate it 1/5)
>> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
>> * enca (targeted towards obscure code pages)
>> * libmagic / `file  --mime-encoding`
>> 
>> They all agreed on pure ASCII files (which is easy to check), and these I 
>> just ignored/accepted as good. The handling of pure binary files differed 
>> between the tools; most detected them as binary but some suggested arcane 
>> encodings for specific (often small) binary files. To keep my sanity, I 
>> decided that files ending in any of these extensions were binary, and I did 
>> not check them further:
>> * 
>> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
>> 
>> From the remaining list of non-ascii, non-known-binary files I selected two 
>> overlapping and exhaustive subsets:
>> * All files where at least one tool claimed it to be UTF-8
>> * All files where at least one tool claimed it to be *not* UTF-8
>> 
>> For the first subset, I checked every non-ASCII character (using `C_ALL=C 
>> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat 
>> names-of-files-to-check.txt)`, and visually examining the results). At this 
>> stage, I found several files where unicode were unnecessarily used instead 
>> of pure ASCII, and I treated those files separately. Other from that, my 
>> inspection revealed no obvious encoding errors. This list comprised of about 
>> 2000 files, so I did not spend too much time on each file. The assumption, 
>> after all, was that these files are okay.
>> 
>> For the second subset, I checked every non-ASCII character (using the same 
>> method). This list was about 300+ files. Most of them were okay far as I can 
>> tell; I can confirm encodings for European languages 100%, but JCK encodings 
>> could theoretically be wrong; they looked sane but I cannot read and confirm 
>> fully. Several were in fact pure...
>
> src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 497:
> 
>> 495: /*
>> 496:   The algorithm below is based on Intel publication:
>> 497:   "Fast SHA-256 Implementations on Intel(R) Architecture Processors" by 
>> Jim Guilford, Kirk Yap and Vinodh Gopal.
> 
> Note: There is of course a unicode `®` symbol, which is what it was 
> originally before it was botched here, but I found no reason to keep this, 
> and in the spirit of JDK-8354213, I thought it better to use pure ASCII here.

I guess the difference at L.1 in the various files is just the BOM?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037161789

Re: RFR: 8354266: Fix non-UTF-8 text encoding

Reply via email to