Hello! We already have a warning about this, see IgniteKernal.checkFileEncoding()
Regards, -- Ilya Kasnacheev пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky <ivanda...@gmail.com>: > >> But now multiple components > >> independently serialize strings for their needs and use default encoding > >> for this. > >> For example DirectByteBufferStreamImplV2#writeString, > >> MetaStorage#writeRaw and so on > We should fix all of them. > > >> BinaryUtils#utf8BytesToStr > Lets use this everywhere. > > As for me, I'm expecting a way more problem with enforcing rule to fail, > rather than enforcing all components to use UTF-8 > Some weird cases (surrogate pairs) we can (I strongly believe it is OK) > simply do not consider at all. > > пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <nizhi...@apache.org>: > > > > Does Java String support all unicode characters and particularly does > it > > support more characters than UTF-8 > > > > It’s not about Java, it’s about UTF-8 standard. > > > > Please, take a look at [1] > > > > > In November 2003, UTF-8 was restricted by RFC 3629 to match the > > constraints of the UTF-16 character encoding: explicitly prohibiting code > > points corresponding to the high and low surrogate characters removed > more > > than 3% of the three-byte sequences, and ending at U+10FFFF removed more > > than 48% of the four-byte sequences and all five- and six-byte sequences. > > > > And [2] > > > > > The definition of UTF-8 prohibits encoding character numbers between > > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding > form > > (as surrogate pairs) and do not directly represent characters. > > > > Actually, we already has some modes to support this restriction of UTF-8. > > Please, take a look at BinaryUtils#utf8BytesToStr [3] > > > > > > [1] https://en.wikipedia.org/wiki/UTF-8 > > [2] https://datatracker.ietf.org/doc/html/rfc3629 > > [3] > > > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 > > > > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <vololo...@gmail.com> > > написал(а): > > > > > >> UTF-8 can’t encode all UNICODE characters. > > > > > > Nikolay, could you please elaborate? My understanding is that encoding > > > we speak about matters for conversion from byte arrays to strings. > > > Does Java String support all unicode characters and particularly does > > > it support more characters than UTF-8 (I am not saying here that java > > > String uses UTF-8)? > > > > > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <ivanda...@gmail.com>: > > >> UTF-8 is already a default encoding in our BinaryObject format. So.... > > I am > > >> for unification. > > >> > > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <nizhi...@apache.org>: > > >> > > >>> Hello, Ivan. > > >>> > > >>> UTF-8 can’t encode all UNICODE characters. > > >>> > > >>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky <ivanda...@gmail.com> > > >>> написал(а): > > >>>> > > >>>> Khm, maybe a better variant is to enforce all strings to be encoded > > in > > >>>> UTF-8? > > >>>> AFAIK multi OS cluster is a quite common case. > > >>>> > > >>>> > > >>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <pmgheap....@gmail.com > >: > > >>>> > > >>>>> Igniters, > > >>>>> > > >>>>> Recently we faced the problem that if the cluster consists of nodes > > >>>>> running in the JVM with different encodings, many issues arise. > > >>>>> The root cause of the mentioned issues is components that use > > >>>>> `String#getBytes()` and `new String(<byte array>)`, which relies on > > >>>>> the > > >>>>> system default encoding. Thus, if a string is deserialized on a > node > > >>>>> with a different encoding from the one that serialized it, the > > >>>>> deserialized string can be different from the original one. > > >>>>> > > >>>>> For example: > > >>>>> > > >>>>> Serialization/deserialization of string in communication messages > may > > >>>>> be > > >>>>> broken for some strings on nodes running in a JVM with a different > > >>>>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to > > >>>>> serialize strings - [1] > > >>>>> > > >>>>> Or the IgniteAuthenticationProcessor can compute different security > > >>>>> IDs > > >>>>> for the user on different nodes in this case - [2] > > >>>>> > > >>>>> What do you think, if we solve this problem globally, by rejecting > to > > >>>>> join nodes that run on JVMs with different encodings? > > >>>>> > > >>>>> As a result, we will be sure that all cluster nodes have the same > > >>>>> encoding and all related problems will be solved. > > >>>>> > > >>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106 > > >>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068 > > >>>>> > > >>>>> -- > > >>>>> Mikhail > > >>>>> > > >>>>> > > >>>> > > >>>> -- > > >>>> Sincerely yours, Ivan Daschinskiy > > >>> > > >>> > > >> > > >> -- > > >> Sincerely yours, Ivan Daschinskiy > > >> > > > > > > > > > -- > > > > > > Best regards, > > > Ivan Pavlukhin > > > > > > -- > Sincerely yours, Ivan Daschinskiy >