Hi.

The appropriate list is core-libs-dev, where this discussion should continue.

System.in is the standard input, which may or may not be the keyboard. For 
keyboard input, take a look at the java.io.Console class [1], in particular its 
charset and reader methods.

[1]: 
https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/io/Console.html

— Ron

On 13 Oct 2022, at 16:20, Reinier Zwitserloot 
<rein...@zwitserloot.com<mailto:rein...@zwitserloot.com>> wrote:

PREAMBLE: I’m not entirely certain amber-dev is the appropriate venue. If not, 
where should this be discussed? It’s not quite a bug but nearly so, and not 
quite a simple feature request either.

JDK18 brought JEP400 which changes the default charset encoding to UTF-8. This, 
probably out of necessity, goes quite far, in that Charset.defaultCharset() is 
now more or less a constant - always returns UTF_8. It’s now quite difficult to 
retrieve the OS-configured encoding (the ’native’ encoding).

However, that does mean one of the most common lines in all of java’s history, 
is now necessarily buggy: new Scanner(System.in) is now broken. Always, unless 
your docs specifically state that you must feed the app UTF_8 data. Linting 
tools ought to flag it down as incorrect. It’s incorrect In a nasty way too: 
Initially it seems to work fine, but if you’re on an OS whose native encoding 
isn’t UTF-8, this is subtly broken; enter non-ASCII characters on the command 
line and the app doesn’t handle them appropriately. A bug that is literally 
utterly undiscoverable on macs and most linux computers, even. How can you 
figure out your code is broken if all the machines you test it on use UTF-8 as 
an OS default?

This affects beginning java programmers particularly (who tend to be writing 
some command line-interactive apps at first). In light of Brian Goetz’s post 
“Paving the Onramp” (https://openjdk.org/projects/amber/design-notes/on-ramp) - 
the experience for new users is evidently of some importance to the OpenJDK 
team. In light of that, the current state of writing command line interactive 
java apps is inconsistent with that goal.

The right way to read system input in a way that works in both pre- and 
post-JEP400 JVM editions appears to be, as far as I can tell:


Charset nativeCharset = Charset.forName(System.getProperty("native.encoding", 
Charset.defaultEncoding().name());
Scanner sc = new Scanner(System.in, nativeCharset);

I’ll risk the hyperbole: That’s.. atrocious. Hopefully I’m missing something!

Breaking _thousands_ of blogs, tutorials, stack overflow answers, and books in 
the process, everything that contains new Scanner(System.in). Even sysin 
interaction that doesn’t use scanner is likely broken; the general strategy 
then becomes:


new InputStreamReader(System.in);

which suffers from the same problem.

I see a few directions for trying to address this; I’m not quite sure which way 
would be most appropriate:


  *   Completely re-work keyboard input, in light of Paving the on-ramp. 
Scanner has always been a problematic API if used for keyboard input, in that 
the default delimiter isn’t convenient. I think the single most common beginner 
java stackoverflow question is the bizarre interaction between scanner’s 
nextLine() and scanner’s next(), and to make matters considerably worse, the 
proper fix (which is to call .useDelimiter(“\\R”) on the scanner first) is said 
in less than 1% of answers; the vast majority of tutorials and answers tell you 
to call .nextLine() after every .nextX() call. A suboptimal suggestion (it now 
means using space to delimit your input is broken). Scanner is now also quite 
inconsistent: The constructor goes for ‘internet standard’, using UTF-8 as a 
default even if the OS does not, but the locale does go by platform default, 
which affects double parsing amongst other things: scanner.nextDouble() will 
require you to use commas as fractions separator if your OS is configured to 
use the Dutch locale, for example. It’s weird that scanner neither fully 
follows common platform-independent expectations (english locale, UTF-8), nor 
local-platform expectation (OS-configured locale and OS-configured charset). 
One way out is to make a new API for ‘command line apps’ and take into account 
Paving the on-ramp’s plans when designing it.
  *   Rewrite specifically the new Scanner(InputStream) constructor as 
defaulting to native encoding even when everything else in java defaults to 
UTF-8 now, because that constructor is 99% used for System.in. Scanner has its 
own File-based constructor, so new Scanner(Files.newInputStream(..)) is quite 
rare.
  *   Define that constructor to act as follows: the charset used is the 
platform default (i.e., from JDK18 and up, UTF-8), unless arg == System.in is 
true, in which case the scanner uses native encoding. This is a bit bizarre to 
write in the spec but does the right thing in the most circumstances and 
unbreaks thousands of tutorials, blogs, and answer sites, and is most 
convenient to code against. That’s usually the case with voodoo magic (because 
this surely risks being ’too magical’): It’s convenient and does the right 
thing almost always, at the risk of being hard to fathom and producing 
convoluted spec documentation.
  *   Attach the problem that what’s really broken isn’t so much scanner, it’s 
System.in itself: byte based, of course, but now that all java methods default 
to UTF-8, almost all interactions with it (given that most System.in 
interaction is char-based, not byte-based) are now also broken. Create a second 
field or method in System that gives you a Reader instead of an InputStream, 
with the OS-native encoding applied to make it. This still leaves those 
thousands of tutorials broken, but at least the proper code is now simply new 
Scanner(System.charIn()) or whatnot, instead of the atrocious snippet above.
  *   Even less impactful, make a new method in Charset to get the native 
encoding without having to delve into System.getProperty(). 
Charset.nativeEncoding() seems like a method that should exist. Unfortunately 
this would be of no help to create code that works pre- and post-JEP400, but in 
time, having code that only works post-JEP400 is fine, I assume.
  *   Create a new concept ‘represents a stream that would use platform native 
encoding if characters are read/written to it’, have System.in return true for 
this, and have filterstreams like BufferedInputStream just pass the call 
through, then redefine relevant APIs such as Scanner and PrintStream (e.g. 
anything that internalises conversion from bytes to characters) to pick charset 
encoding (native vs UTF8) based on that property. This is a more robust take on 
‘new Scanner(System.in) should do the right thing'. Possibly the in/out/err 
streams that Process gives you should also have this flag set.


If it was up to me, I think a multitude of steps are warranted, each relatively 
simple.


  *   Create Charset.nativeEncoding(). Which simply returns 
Charset.forName(System.getProperty(“native.encoding”). But with the advantage 
that its shorter, doesn’t require knowing a magic string, and will fail at 
compile time if compiled against versions that predate the existence of the 
native.encoding property, instead of NPEs at runtime.
  *   Create System.charIn(). Which just returns an InputStreamReader wrapped 
around System.in, but with native encoding applied.
  *   Put the job of how java apps do basic command line stuff on the agenda as 
a thing that should probably be addressed in the next 5 years or so, maybe 
after the steps laid out in Paving the on-ramp are more fleshed out.
  *   In order to avoid problems, before the next LTS goes out, re-spec new 
Scanner(System.in) to default to native encoding, specifically when the passed 
inputstream is identical to System.in. Don’t bother with trying to introduce an 
abstracted ‘prefers native encoding’ flag system.

 --Reinier Zwitserloot

Reply via email to