I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found.
BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. Methodology used: I have run four different tools for using different heuristics for determining the encoding of a file: * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) * uchardet (a modern version by freedesktop, used by e.g. Firefox) * enca (targeted towards obscure code pages) * libmagic / `file --mime-encoding` They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` >From the remaining list of non-ascii, non-known-binary files I selected two >overlapping and exhaustive subsets: * All files where at least one tool claimed it to be UTF-8 * All files where at least one tool claimed it to be *not* UTF-8 For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling extensions (most of these are in tests). The BOM files were only pointed out by chardetect; I did run an additional search for UTF-8 BOM markers over the code base to make sure I did not miss any others (since chardetect apart from this did a not-so-perfect job). The files included in this PR are what I actually found that had encoding errors or issues. ------------- Commit messages: - Remove UTF-8 BOM (byte-order mark) which is discouraged by the Unicode Consortium - Fix incorrect encoding Changes: https://git.openjdk.org/jdk/pull/24566/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24566&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8354266 Stats: 32 lines in 13 files changed: 0 ins; 2 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/24566.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24566/head:pull/24566 PR: https://git.openjdk.org/jdk/pull/24566