On Thu, 10 Apr 2025 17:09:27 GMT, Naoto Sato <na...@openjdk.org> wrote:
>> I have checked the entire code base for incorrect encodings, but luckily >> enough these were the only remaining problems I found. >> >> BOM (byte-order mark) is a method used for distinguishing big and little >> endian UTF-16 encodings. There is a special UTF-8 BOM, but it is >> discouraged. In the words of the Unicode Consortium: "Use of a BOM is >> neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful >> of files. These should be removed. >> >> Methodology used: >> >> I have run four different tools for using different heuristics for >> determining the encoding of a file: >> * chardetect (the original, slow-as-molasses Perl program, which also had >> the worst performing heuristics of all; I'll rate it 1/5) >> * uchardet (a modern version by freedesktop, used by e.g. Firefox) >> * enca (targeted towards obscure code pages) >> * libmagic / `file --mime-encoding` >> >> They all agreed on pure ASCII files (which is easy to check), and these I >> just ignored/accepted as good. The handling of pure binary files differed >> between the tools; most detected them as binary but some suggested arcane >> encodings for specific (often small) binary files. To keep my sanity, I >> decided that files ending in any of these extensions were binary, and I did >> not check them further: >> * >> `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` >> >> From the remaining list of non-ascii, non-known-binary files I selected two >> overlapping and exhaustive subsets: >> * All files where at least one tool claimed it to be UTF-8 >> * All files where at least one tool claimed it to be *not* UTF-8 >> >> For the first subset, I checked every non-ASCII character (using `C_ALL=C >> ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat >> names-of-files-to-check.txt)`, and visually examining the results). At this >> stage, I found several files where unicode were unnecessarily used instead >> of pure ASCII, and I treated those files separately. Other from that, my >> inspection revealed no obvious encoding errors. This list comprised of about >> 2000 files, so I did not spend too much time on each file. The assumption, >> after all, was that these files are okay. >> >> For the second subset, I checked every non-ASCII character (using the same >> method). This list was about 300+ files. Most of them were okay far as I can >> tell; I can confirm encodings for European languages 100%, but JCK encodings >> could theoretically be wrong; they looked sane but I cannot read and confirm >> fully. Several were in fact pure... > > src/java.desktop/share/legal/lcms.md line 72: > >> 70: Mateusz Jurczyk (Google) >> 71: Paul Miller >> 72: Sébastien Léon > > I cannot comment on capitalization here, but if we wanted to lowercase them, > should they be e-grave instead of e-acute? If this is a French name, it's e acute: é. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037917708