Note that Java strings DO allow the presence of lone surrogates, as well as non-characters , because Java strings are unrestricted vectors of 16-bit code units (non-BMP characters are handled as pairs of surrogates).
In those conditions, normalizing the Java string will leave those lone surrogates (and non-characters) as is, or will throw an exception, depending on the API used. Java strings do not have any implied encoding (their "char" members are also unrestricted 16-bit code units, they have some basic properties but only in BMP, defined in the builtin Character class API: properties for non-BMP characters require using a library to provide them, such as ICU4J). This is essentially the same kind as C/C++ "wide" strings using 16-bit wchar_t, except that: - C/C++ wide strings do not allow the inclusion of U+0000 which is a terminator, unless you use a string class keeping the actual string length (and not just the allocated buffer length which may be larger). - Java strings, including litterals, are immutable, and optionally atomized into a global dictionary, which includes all string litterals to share the storage space of multiple instances with equal contents, including across distinct classes from distinct packages. - This also true for string literals (which are all immutable and atomized, and initialized from the compiled bytecode of classes using a modified version of UTF-8 that preserves all 16-bit code units (including lone surrogates and non-characters like U+FFFF), but also store U+0000 as <0xC0,0x80>. This modified UTF-8 encoding is also what you get if you use the JNI interface version with 8-bit string (this internally requires a conversion by JNI, using a temporary buffer); if you use the JNI interface version with 16-bit strings, you work directly with the internal 16-bit java strings and there's no conversion: you'll also get the lone surrogates and all non-characters and you are not restricted to only valid UTF-16. - Java strings are commonly used for fast initialization of large immutable binary arrays because the conversion from Modified-UTF-8 to 16-bit strings does not require running any compîled bytecode (this is not true for other static arrays which requires large code for array litterals and not warrantied to be immutable: the alternative to this large compiled code is to initialize those large static arrays by I*/O *from an external stream, such as a file beside the class in the same package, and possibly packed in the same JAR). Java passwords are "strings" but then still allow them to include arbitrary 16-bit code units, even if they violate UTF-16 restrictions. You will not get much difference is you use byte arrays, the only change being the difference of size of code units. Between those two representation you are free to convert them with ANY encodings pair, and not just assuming UTF-8<>UTF-16. However, for security reasons, it's best to avoid string litterals for passwords, because they can be enumerated from the global dictionnary of atomized strings, or directly by reading the byte code of the compiled class where they are sored in modified-UTF-8 but loaded and used as arbitrary 16-bit strings (but the same is true if you use a byte array literal ! you can just parse the initilization byte code to get the list of bytes). If passwords or authorization keys are stored somewhere (as strings or as byte arrays) they should be encrypted into a safe storage and not in static string litterals or byte array initializers (they will BOTH be clear text in the bytecode of the compiled class). In both cases, there is NO normalization applied implicitly or checked/enforced by the API (the only check that occurs is at class loading time for the Modified-UTF-8 encoding for string literals: if it is wrong the class will not load at all, you'll get an invalid class exception; there's no such ckeck at all for the encoding of byte array initializers, the only checks are the validity of the java initializer byte code and bounds of array indexes used by the initiliazer code). 2015-10-06 5:39 GMT+02:00 Martin J. Dürst <[email protected]>: > On 2015/10/01 13:11, Jonathan Rosenne wrote: > >> For languages such as Java, passwords should be handled as byte arrays >> rather than strings. This may make it difficult to apply normalization. >> > > Well, they should be received from the user interface as strings, then > normalized, then converted to byte arrays using a well-defined single > encoding. Somewhat tedious, but hopefully not difficult. > > Regards, Martin. >

