Hi sebb and all, Here’s a revised proposed addition to ReversedLinesFileReader to support the CCJK Windows code pages:
... } else if(charset == Charset.forName("Shift_JIS") || // Same as for UTF-8 http://www.herongyang.com/Unicode/JIS-Shift-JIS-Encoding.html charset == Charset.forName("windows-31j") || // Windows code page 932 (Japanese) charset == Charset.forName("x-windows-949") || // Windows code page 949 (Korean) charset == Charset.forName("gbk") || // Windows code page 936 (Simplified Chinese) charset == Charset.forName("x-windows-950")) { // Windows code page 950 (Traditional Chinese) byteDecrement = 1; } … A newline byte never appears as part of a multi-byte character in any of these encodings. Thanks and regards, Leandro On 3/2/15, 4:02 PM, "Leandro Reis" <lr...@adobe.com> wrote: >On 2 March 2015 at 21:53, sebb wrote: > >>>On 2 March 2015 at 20:00, Leandro Reis <lr...@adobe.com> wrote: >>>Hi all, >>> >>>I¹m working on a product that uses Commons IO via Jackrabbit Oak. In the >>>process of testing the launch of such product on Japanese Windows 2012 >>>Server R2, I came across the following exception: >>>"(java.io.UnsupportedEncodingException: Encoding windows-31j is not >>>supported yet (feel free to submit a patch))" >>> >>>windows-31j is the IANA name for Windows code page 932 (Japanese), and >>>is >>>returned by Charset.defaultCharset(), used in >>>org.apache.commons.io.input.ReversedLinesFileReader [0]. >>> >>> >>>It looks like this issue could be addressed by adding a check for >>>³windows-31j² to ReversedLinesFileReader(final File file, final int >>>blockSize, final Charset encoding): >>> >>> >>>... >>>} else if(charset.equals(Charset.forName("windows-31j"))) { >>> byteDecrement = 1; >>>} >>>... >>> >>>Similar changes would be needed in order to support the Chinese >>>Simplified, Chinese Traditional, and Korean versions of the same OS (I¹m >>>checking what the corresponding encoding names are). >>> >>>Can someone familiar with this area of the code confirm this looks like >>>the proper approach to addressing this? > >>Can a newline byte ever appear as part of a multi-byte character in any >>of those encodings? >No. Sources: >- Japanese: >http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT >- Simplified Chinese: >http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT >- Korean: >http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT >- Traditional Chinese: >http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT > > >>>Thanks, >>> Leandro >>> >>>[0] >>>http://svn.apache.org/viewvc/commons/proper/io/trunk/src/main/java/org/a >>>p >>>ache/commons/io/input/ReversedLinesFileReader.java?view=markup > > >