Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

2012-02-10 Thread Masayoshi Okutsu
I tend to agree with Sherman that the real problem is the 
OutputStreamWriter API which isn't good enough to handle various 
encodings. My understanding is that the charset API was introduced in 
1.4 to deal with the limitations of the java.io and other encoding 
handling issues.


I still don't think it's correct to change the flush semantics. The 
flush method should just flush out any outstanding data to the given 
output stream as defined in java.io.Writer. What if Writer.write(int) is 
used write UTF-16 data with any stateful encoding? Suppose the stateful 
encoding can support supplementary characters which require other G0 
designations, does the following calling sequence work?


writer.write(highSurrogate);
writer.flush();
writer.write(lowSurrogate);
writer.flush();

Of course, this isn't a problem with iso-2022-jp, though.

I think it's a correct fix, not a workaround, to create a filter stream 
to deal with stateful encodings with the java.io API. If it's OK to 
support only 1.4 and later, the java.nio.charset API should be used.


Thanks,
Masayoshi

On 2/10/2012 4:12 AM, Xueming Shen wrote:

CCed Bill Shannon.

On 02/09/2012 11:10 AM, Xueming Shen wrote:


CharsetEncoder has the "flush()" method as the last step (of a series 
of "encoding" steps) to
flush out any internal state to the output buffer. The issue here is 
the the upper level wrapper
class, OutputStreamWriter in our case, doesn't provide a "explicit" 
mechanism to let the
user to request a "flush" on the underlying encoder. The only 
"guaranteed' mechanism is the
"close()" method, in which it appears it not appropriate to invoke in 
some use scenario, such

as the JavaMail.writeTo() case.

It appears we are lacking of a "close this stream, but not the 
underlying stream" mechanism
in our layered/chained streams, I have similar request for this kind 
of mechanism in other area,
such as in zip/gzip stream, app wraps a "outputstream" with zip/gzip, 
they want to release the
zip/gzip layer after use (to release the native resource, for 
example) but want to keep the
underlying stream unclosed. The only workaround now is to wrap the 
underlying stream with
a subclass to override  the "close()" method, which is really not 
desirable.


The OutputStreamWriter.flush() does not explicitly specify in its API 
doc if it should actually
flush the underlying charset encoder (so I would not argue strongly 
that this IS a SE bug) but
given it is flushing it's buffer (internal status) and then the 
underlying "out" stream, it's
reasonable to consider that the "internal status" of its encoder also 
needs to be flushed.

Especially this has been the behavior for releases earlier than 1.4.2.

As I said,  while I have been hesitated to "fix" this problem for a 
while (one it has been here
for 3  major releases, two, the API does not explicitly say so) but 
as long as we don't have a
reasonable "close-ME-only" mechanism for those layered streams, it 
appears to be a
reasonable approach to solve the problem, without having obvious 
negative impact.


-Sherman

PS: There is another implementation "detail" that the original 
iso-2022-jp c2b converter
actually restores the state back to ASCII mode at the end of its 
"convert" method, this makes
the analysis a little complicated, but should not change the issue we 
are discussing)



On 02/09/2012 12:26 AM, Masayoshi Okutsu wrote:
First of all, is this really a Java SE bug? The usage of 
OutputSteamWriter in JavaMail seems to be wrong to me. The writeTo 
method in the bug report doesn't seem to be able to deal with any 
stateful encodings.


Masayoshi

On 2/9/2012 3:26 PM, Xueming Shen wrote:

Hi

This is a long standing "regression" from 1.3.1 on how 
OutputStreamWriter.flush()/flushBuffer()
handles escape or shift sequence in some of the charset/encoding, 
for example the ISO-2022-JP.


ISO-2022-JP is encoding that starts with ASCII mode and then 
switches between ASCII andJapanese
characters through an escape sequence. For example, the escape 
sequence ESC $ B (0x1B, 0x24 0x42)
is used to  indicate the following bytes are Japanese (switch from 
ASCII mode to Japanese mode), and

 the ESC ( B (0x1b  0x28  0x42) is used to switch back to ASCII.

In Java's sun.io.CharToByteConvert (old generation charset 
converter) and the nio.io.charset.CharsetEncoder
usually switches back forth between ASCII and Japanese modes based 
on the input character sequence
(for example, if you are in ASCII mode, and your next input 
character is a Japanese, you add the
ESC $ B into the output first and then followed the converted input 
character, or if you are in Japanese
mode and your next input is ASCII, you output ESC ( B first to 
switch the mode and then the ASCII) and
switch back to ASCII mode (if the last mode is non-Japanese) if 
either the encoding is ending or the

flush() method gets invoked.

In JDK1.3.1,  OutputStreamWriter.flushBuffer() explicitly invokes 
sun.io.c2b's flushAny() to switch
back to ASCII mode every 

Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

2012-02-10 Thread Jason Mehrens

Sherman,
 
As a workaround, what about allowing a write of empty string or empty char 
array to call flushBuffer?  If you call PrintStream.print("") then flushBuffer 
is called on the internal writers.  But if you try the same by doing 
OuputStreamWriter.write("") the flushbuffer call is trapped by a zero len if 
condition in StreamEncoder.write(char[], int, int).  That seems inconsistent.  
If that check was removed then the workaround would be:
 
===
ByteArrayOutputStream bos = new ByteArrayOutputStream();
String str = "\u6700"; 
OutputStreamWriter osw = new OutputStreamWriter(bos, "iso-2022-jp");
osw.write(str, 0, str.length());
osw.write(""); //Flush buffer.
osw.flush(); //Flush stream.
==
 
It's not pretty but, it would allow the caller to choose when flushBuffer is 
called.
 
Jason
 

> Date: Wed, 8 Feb 2012 22:26:54 -0800
> From: xueming.s...@oracle.com
> To: core-libs-...@openjdk.java.net; i18n-dev@openjdk.java.net
> Subject: Codereview request for 6995537: different behavior in iso-2022-jp 
> encoding between jdk131/140/141 and jdk142/5/6/7
> 
> The solution is to re-store the "flush the encoder" mechanism in 
> StreamEncoder's flushBuffer().
> 
> I have been hesitated to make this change for a while, mostly because 
> this regressed behavior
> has been their for 3 releases, and the change triggers yet another 
> "behavior change". But given
> there is no obvious workaround and it only changes the behavior of the 
> charsets with this
> shift in/out mechanism, mainly the iso-2022 family and those IBM 
> EBCDIC_DBCS charsets, I
> decided to give it a try.
> 
> Here is the webreview
> 
> http://cr.openjdk.java.net/~sherman/6995537/webrev
> 
> Sherman
> 
  

Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

2012-02-10 Thread Jason Mehrens

Sherman,
 
My mistake, I missed the fact that flushBuffer does not flush the encoder.  I 
incorrectly thought that write/print caused the encoder to flush and only the 
direct call to OSW.flush did not.
 
Jason
 



Date: Thu, 9 Feb 2012 11:29:04 -0800
From: xueming.s...@oracle.com
To: jason_mehr...@hotmail.com
CC: core-libs-...@openjdk.java.net; i18n-dev@openjdk.java.net
Subject: Re: Codereview request for 6995537: different behavior in iso-2022-jp 
encoding between jdk131/140/141 and jdk142/5/6/7


Jason,

I might be misunderstanding your suggestion, but the current implementation of
OutputStreamWriter.flushBuffer()/StreamWriter.implFlushBuffer() does not flush
the encoder, so even the caller can choose when to invoke flushBuffer(), it does
not solve the problem (flush() invokes flushBuffer() before it flushes the "out"
stream). The proposed the change is to put back the encoder.flush() into
osw.flushBuffer (in StreamWriter.implFlushBuffer()).

-Sherman


On 02/09/2012 10:24 AM, Jason Mehrens wrote: 



Sherman,
 
As a workaround, what about allowing a write of empty string or empty char 
array to call flushBuffer?  If you call PrintStream.print("") then flushBuffer 
is called on the internal writers.  But if you try the same by doing 
OuputStreamWriter.write("") the flushbuffer call is trapped by a zero len if 
condition in StreamEncoder.write(char[], int, int).  That seems inconsistent.  
If that check was removed then the workaround would be:
 
===
ByteArrayOutputStream bos = new ByteArrayOutputStream();
String str = "\u6700"; 
OutputStreamWriter osw = new OutputStreamWriter(bos, "iso-2022-jp");
osw.write(str, 0, str.length());
osw.write(""); //Flush buffer.
osw.flush(); //Flush stream.
==
 
It's not pretty but, it would allow the caller to choose when flushBuffer is 
called.
 
Jason
 

> Date: Wed, 8 Feb 2012 22:26:54 -0800
> From: xueming.s...@oracle.com
> To: core-libs-...@openjdk.java.net; i18n-dev@openjdk.java.net
> Subject: Codereview request for 6995537: different behavior in iso-2022-jp 
> encoding between jdk131/140/141 and jdk142/5/6/7
> 
> The solution is to re-store the "flush the encoder" mechanism in 
> StreamEncoder's flushBuffer().
> 
> I have been hesitated to make this change for a while, mostly because 
> this regressed behavior
> has been their for 3 releases, and the change triggers yet another 
> "behavior change". But given
> there is no obvious workaround and it only changes the behavior of the 
> charsets with this
> shift in/out mechanism, mainly the iso-2022 family and those IBM 
> EBCDIC_DBCS charsets, I
> decided to give it a try.
> 
> Here is the webreview
> 
> http://cr.openjdk.java.net/~sherman/6995537/webrev
> 
> Sherman
> 

  

Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

2012-02-10 Thread Bill Shannon

If you flush the stream while in the middle of writing a "character",
I would expect the results to be undefined, just as if you closed the
stream at that point.  But at the end of a consistent set of data, I
would expect flush to behave like close, but without the actual closing
of the stream.

Masayoshi Okutsu wrote on 02/10/12 00:31:

I tend to agree with Sherman that the real problem is the OutputStreamWriter API
which isn't good enough to handle various encodings. My understanding is that
the charset API was introduced in 1.4 to deal with the limitations of the
java.io and other encoding handling issues.

I still don't think it's correct to change the flush semantics. The flush method
should just flush out any outstanding data to the given output stream as defined
in java.io.Writer. What if Writer.write(int) is used write UTF-16 data with any
stateful encoding? Suppose the stateful encoding can support supplementary
characters which require other G0 designations, does the following calling
sequence work?

writer.write(highSurrogate);
writer.flush();
writer.write(lowSurrogate);
writer.flush();

Of course, this isn't a problem with iso-2022-jp, though.

I think it's a correct fix, not a workaround, to create a filter stream to deal
with stateful encodings with the java.io API. If it's OK to support only 1.4 and
later, the java.nio.charset API should be used.

Thanks,
Masayoshi

On 2/10/2012 4:12 AM, Xueming Shen wrote:

CCed Bill Shannon.

On 02/09/2012 11:10 AM, Xueming Shen wrote:


CharsetEncoder has the "flush()" method as the last step (of a series of
"encoding" steps) to
flush out any internal state to the output buffer. The issue here is the the
upper level wrapper
class, OutputStreamWriter in our case, doesn't provide a "explicit" mechanism
to let the
user to request a "flush" on the underlying encoder. The only "guaranteed'
mechanism is the
"close()" method, in which it appears it not appropriate to invoke in some
use scenario, such
as the JavaMail.writeTo() case.

It appears we are lacking of a "close this stream, but not the underlying
stream" mechanism
in our layered/chained streams, I have similar request for this kind of
mechanism in other area,
such as in zip/gzip stream, app wraps a "outputstream" with zip/gzip, they
want to release the
zip/gzip layer after use (to release the native resource, for example) but
want to keep the
underlying stream unclosed. The only workaround now is to wrap the underlying
stream with
a subclass to override the "close()" method, which is really not desirable.

The OutputStreamWriter.flush() does not explicitly specify in its API doc if
it should actually
flush the underlying charset encoder (so I would not argue strongly that this
IS a SE bug) but
given it is flushing it's buffer (internal status) and then the underlying
"out" stream, it's
reasonable to consider that the "internal status" of its encoder also needs
to be flushed.
Especially this has been the behavior for releases earlier than 1.4.2.

As I said, while I have been hesitated to "fix" this problem for a while (one
it has been here
for 3 major releases, two, the API does not explicitly say so) but as long as
we don't have a
reasonable "close-ME-only" mechanism for those layered streams, it appears to
be a
reasonable approach to solve the problem, without having obvious negative
impact.

-Sherman

PS: There is another implementation "detail" that the original iso-2022-jp
c2b converter
actually restores the state back to ASCII mode at the end of its "convert"
method, this makes
the analysis a little complicated, but should not change the issue we are
discussing)


On 02/09/2012 12:26 AM, Masayoshi Okutsu wrote:

First of all, is this really a Java SE bug? The usage of OutputSteamWriter
in JavaMail seems to be wrong to me. The writeTo method in the bug report
doesn't seem to be able to deal with any stateful encodings.

Masayoshi

On 2/9/2012 3:26 PM, Xueming Shen wrote:

Hi

This is a long standing "regression" from 1.3.1 on how
OutputStreamWriter.flush()/flushBuffer()
handles escape or shift sequence in some of the charset/encoding, for
example the ISO-2022-JP.

ISO-2022-JP is encoding that starts with ASCII mode and then switches
between ASCII andJapanese
characters through an escape sequence. For example, the escape sequence ESC
$ B (0x1B, 0x24 0x42)
is used to indicate the following bytes are Japanese (switch from ASCII
mode to Japanese mode), and
the ESC ( B (0x1b 0x28 0x42) is used to switch back to ASCII.

In Java's sun.io.CharToByteConvert (old generation charset converter) and
the nio.io.charset.CharsetEncoder
usually switches back forth between ASCII and Japanese modes based on the
input character sequence
(for example, if you are in ASCII mode, and your next input character is a
Japanese, you add the
ESC $ B into the output first and then followed the converted input
character, or if you are in Japanese
mode and your next input is ASCII, you output ESC ( B first to switch the
mode and t

Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

2012-02-10 Thread Xueming Shen



On 2/10/2012 12:31 AM, Masayoshi Okutsu wrote:
I tend to agree with Sherman that the real problem is the 
OutputStreamWriter API which isn't good enough to handle various 
encodings. My understanding is that the charset API was introduced in 
1.4 to deal with the limitations of the java.io and other encoding 
handling issues.


I still don't think it's correct to change the flush semantics. The 
flush method should just flush out any outstanding data to the given 
output stream as defined in java.io.Writer. What if Writer.write(int) 
is used write UTF-16 data with any stateful encoding? Suppose the 
stateful encoding can support supplementary characters which require 
other G0 designations, does the following calling sequence work?


writer.write(highSurrogate);
writer.flush();
writer.write(lowSurrogate);
writer.flush();


No it does not work. But I would be less concerned with such a charset 
that we don't have anywhere around, yet.
The real concern is that if you invoke the above sequence, the 
implementation actually "buffered" the highSurr
in its internal field "leftoverChar", and you will get "incompatible" 
result for above invocation (for charset that
can handle surrogates, such as UTF8),   "leftoverChar" would be process 
as a single surrogate, if you "flush" the
osw before write the low surr.  But, to save the "leftover" as its 
internal status is kinda against OutputStreamWriter's
class spec "Note that the characters passed to the write() methods are 
not buffered", though I don't see any
better solution for this scenario (you really don't want to have 
OutputSteamWriterto have an explicit interface

to handle CoderResult...)

That said, the spec also specifies that

"A /malformed surrogate element/ is a high surrogate that is not 
followed by a low surrogate or a low surrogate

that is not preceded by a high surrogate."

So arguably,  based on the spec, you are not supposed to invoke 
"flush()" between two paired surrogates, if you
want them to be treated as a pair of surrogate for a supplementary 
character.


This is what I have been debating with myself for months. As I said in 
my previous email, one alternative is to have
a "close ME only" method for layered streams, but it's not going to 
solve the problems for any previous releases,
we are talking about 1.4.x, 1.5, 6, and 7. Another ugly one is to have a 
"system property" to switch the behavior.


I'm not sure I understand your suggestion of "create a filter...",  are 
you suggesting to have a new filter stream
class in java.io to handle the "stateful encodings", or you are 
suggesting the app like JavaMail should do the filter

stream subclass to deal with this issue?

-Sherman




Of course, this isn't a problem with iso-2022-jp, though.

I think it's a correct fix, not a workaround, to create a filter 
stream to deal with stateful encodings with the java.io API. If it's 
OK to support only 1.4 and later, the java.nio.charset API should be 
used.


Thanks,
Masayoshi

On 2/10/2012 4:12 AM, Xueming Shen wrote:

CCed Bill Shannon.

On 02/09/2012 11:10 AM, Xueming Shen wrote:


CharsetEncoder has the "flush()" method as the last step (of a 
series of "encoding" steps) to
flush out any internal state to the output buffer. The issue here is 
the the upper level wrapper
class, OutputStreamWriter in our case, doesn't provide a "explicit" 
mechanism to let the
user to request a "flush" on the underlying encoder. The only 
"guaranteed' mechanism is the
"close()" method, in which it appears it not appropriate to invoke 
in some use scenario, such

as the JavaMail.writeTo() case.

It appears we are lacking of a "close this stream, but not the 
underlying stream" mechanism
in our layered/chained streams, I have similar request for this kind 
of mechanism in other area,
such as in zip/gzip stream, app wraps a "outputstream" with 
zip/gzip, they want to release the
zip/gzip layer after use (to release the native resource, for 
example) but want to keep the
underlying stream unclosed. The only workaround now is to wrap the 
underlying stream with
a subclass to override  the "close()" method, which is really not 
desirable.


The OutputStreamWriter.flush() does not explicitly specify in its 
API doc if it should actually
flush the underlying charset encoder (so I would not argue strongly 
that this IS a SE bug) but
given it is flushing it's buffer (internal status) and then the 
underlying "out" stream, it's
reasonable to consider that the "internal status" of its encoder 
also needs to be flushed.

Especially this has been the behavior for releases earlier than 1.4.2.

As I said,  while I have been hesitated to "fix" this problem for a 
while (one it has been here
for 3  major releases, two, the API does not explicitly say so) but 
as long as we don't have a
reasonable "close-ME-only" mechanism for those layered streams, it 
appears to be a
reasonable approach to solve the problem, without having obvious 
negative impact.


-Sherman

PS: There is a