RFR: 8354266: Fix non-UTF-8 text encoding

Magnus Ihse Bursie Thu, 10 Apr 2025 10:59:01 -0700

I have checked the entire code base for incorrect encodings, but luckily enough 
these were the only remaining problems I found.


BOM (byte-order mark) is a method used for distinguishing big and little endian 
UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the 
words of the Unicode Consortium: "Use of a BOM is neither required nor 
recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should 
be removed.

Methodology used: 

I have run four different tools for using different heuristics for determining 
the encoding of a file:
* chardetect (the original, slow-as-molasses Perl program, which also had the 
worst performing heuristics of all; I'll rate it 1/5)
* uchardet (a modern version by freedesktop, used by e.g. Firefox)
* enca (targeted towards obscure code pages)
* libmagic / `file  --mime-encoding`

They all agreed on pure ASCII files (which is easy to check), and these I just 
ignored/accepted as good. The handling of pure binary files differed between 
the tools; most detected them as binary but some suggested arcane encodings for 
specific (often small) binary files. To keep my sanity, I decided that files 
ending in any of these extensions were binary, and I did not check them further:
* 
`gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`

>From the remaining list of non-ascii, non-known-binary files I selected two 
>overlapping and exhaustive subsets:
* All files where at least one tool claimed it to be UTF-8
* All files where at least one tool claimed it to be *not* UTF-8

For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep 
-H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and 
visually examining the results). At this stage, I found several files where 
unicode were unnecessarily used instead of pure ASCII, and I treated those 
files separately. Other from that, my inspection revealed no obvious encoding 
errors. This list comprised of about 2000 files, so I did not spend too much 
time on each file. The assumption, after all, was that these files are okay.

For the second subset, I checked every non-ASCII character (using the same 
method). This list was about 300+ files. Most of them were okay far as I can 
tell; I can confirm encodings for European languages 100%, but JCK encodings 
could theoretically be wrong; they looked sane but I cannot read and confirm 
fully. Several were in fact pure binary files, but without any telling 
extensions (most of these are in tests). The BOM files were only pointed out by 
chardetect; I did run an additional search for UTF-8 BOM markers over the code 
base to make sure I did not miss any others (since chardetect apart from this 
did a not-so-perfect job).

The files included in this PR are what I actually found that had encoding 
errors or issues.

-------------

Commit messages:
 - Remove UTF-8 BOM (byte-order mark) which is discouraged by the Unicode 
Consortium
 - Fix incorrect encoding

Changes: https://git.openjdk.org/jdk/pull/24566/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24566&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8354266
  Stats: 32 lines in 13 files changed: 0 ins; 2 del; 30 mod
  Patch: https://git.openjdk.org/jdk/pull/24566.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24566/head:pull/24566

PR: https://git.openjdk.org/jdk/pull/24566

RFR: 8354266: Fix non-UTF-8 text encoding

Reply via email to