Re: RFR: 8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8

Jeremie Miserez Thu, 03 Oct 2024 01:55:59 -0700

On Fri, 23 Aug 2024 10:38:38 GMT, Pratiksha.Sawant <d...@openjdk.org> wrote:


> Mapping ISO-8859-8-I charset to ISO-8859-8.
> Below mentioned 2 aliases are added as part of this:-
> **ISO-8859-8-I**
> **ISO8859-8-I**
> 
> The bug report for the same:- https://bugs.openjdk.org/browse/JDK-8195686

One more thing: I forgot to explain why the alias ISO-8859-8-i -> ISO-8859-8 
would definitely be correct.

Java strings are stored in logical order. That is true for both LTR and RTL 
languages. This is plainly apparent from the OpenJDK String source code, but 
also explicitly mentioned/explained e.g. by official tutorials such as here: 
https://docs.oracle.com/javase/tutorial/2d/text/textlayoutbidirectionaltext.html#ordering_text

ISO-8859-8-i texts are always sent in logical order (by definition). **So 
decoding a ISO-8859-8-i text into a Java string using the ISO-8859-8 alias will 
result in the correct order of characters in the Java string, i.e. logical 
order, and thus is always 100% correct by definition.**

Technically speaking, and for completeness sake here is the full list of cases 
for regular ISO-8859-8 today:

1. ISO-8859-8 texts may contain either LTR language content, in which case the 
text is correctly decoded to a Java string in logical order. -> OK
2. ISO-8859-8 texts may also contain RTL language content in logical order 
(newer applications already do this), in which case the text is also correctly 
decoded to a Java string in logical order. -> OK (this is the case if the alias 
is added)
3. But: If a ISO-8859-8 text contains RTL language content in visual order (old 
applications, historically the case), the text would be decoded to a Java 
string in visual order. This is actually technically incorrect and may be a 
source of bugs (e.g. concatenation won't work correctly). However this behavior 
cannot be changed in OpenJDK anymore as (old) applications may rely on it.

So: As long as nobody adds a "auto-reverse visual to logical order" heuristic 
for RTL ISO-8859-8 text decoding in OpenJDK (which I'm fairly certain can't / 
mustn't be done), using a simple alias ISO-8859-8-i -> ISO-8859-8 will thus 
always be correct. The alias will result in case 2, i.e. texts will always be 
decoded into the correct Java string in logical order.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20690#issuecomment-2390872037

Re: RFR: 8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8

Reply via email to