Hello!

We already have a warning about this, see IgniteKernal.checkFileEncoding()

Regards,
-- 
Ilya Kasnacheev


пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky <ivanda...@gmail.com>:

> >> But now multiple components
> >> independently serialize strings for their needs and use default encoding
> >> for this.
> >> For example  DirectByteBufferStreamImplV2#writeString,
> >> MetaStorage#writeRaw and so on
> We should fix all of them.
>
> >> BinaryUtils#utf8BytesToStr
> Lets use this everywhere.
>
> As for me, I'm expecting a way more problem with enforcing rule to fail,
> rather than enforcing all components to use UTF-8
> Some weird cases  (surrogate pairs) we can (I strongly believe it is OK)
> simply do not consider at all.
>
> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <nizhi...@apache.org>:
>
> > > Does Java String support all unicode characters and particularly does
> it
> > support more characters than UTF-8
> >
> > It’s not about Java, it’s about UTF-8 standard.
> >
> > Please, take a look at [1]
> >
> > > In November 2003, UTF-8 was restricted by RFC 3629 to match the
> > constraints of the UTF-16 character encoding: explicitly prohibiting code
> > points corresponding to the high and low surrogate characters removed
> more
> > than 3% of the three-byte sequences, and ending at U+10FFFF removed more
> > than 48% of the four-byte sequences and all five- and six-byte sequences.
> >
> > And [2]
> >
> > > The definition of UTF-8 prohibits encoding character numbers between
> > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding
> form
> > (as surrogate pairs) and do not directly represent characters.
> >
> > Actually, we already has some modes to support this restriction of UTF-8.
> > Please, take a look at BinaryUtils#utf8BytesToStr [3]
> >
> >
> > [1] https://en.wikipedia.org/wiki/UTF-8
> > [2] https://datatracker.ietf.org/doc/html/rfc3629
> > [3]
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387
> >
> > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <vololo...@gmail.com>
> > написал(а):
> > >
> > >> UTF-8 can’t encode all UNICODE characters.
> > >
> > > Nikolay, could you please elaborate? My understanding is that encoding
> > > we speak about matters for conversion from byte arrays to strings.
> > > Does Java String support all unicode characters and particularly does
> > > it support more characters than UTF-8 (I am not saying here that java
> > > String uses UTF-8)?
> > >
> > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <ivanda...@gmail.com>:
> > >> UTF-8 is already a default encoding in our BinaryObject format. So....
> > I am
> > >> for unification.
> > >>
> > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <nizhi...@apache.org>:
> > >>
> > >>> Hello, Ivan.
> > >>>
> > >>> UTF-8 can’t encode all UNICODE characters.
> > >>>
> > >>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky <ivanda...@gmail.com>
> > >>> написал(а):
> > >>>>
> > >>>> Khm, maybe a better variant is  to enforce all strings to be encoded
> > in
> > >>>> UTF-8?
> > >>>> AFAIK multi OS cluster is a quite common case.
> > >>>>
> > >>>>
> > >>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <pmgheap....@gmail.com
> >:
> > >>>>
> > >>>>> Igniters,
> > >>>>>
> > >>>>> Recently we faced the problem that if the cluster consists of nodes
> > >>>>> running in the JVM with different encodings, many issues arise.
> > >>>>> The root cause of the mentioned issues is components that use
> > >>>>> `String#getBytes()` and `new String(<byte array>)`, which relies on
> > >>>>> the
> > >>>>> system default encoding. Thus, if a string is deserialized on a
> node
> > >>>>> with a different encoding from the one that serialized it, the
> > >>>>> deserialized string can be different from the original one.
> > >>>>>
> > >>>>> For example:
> > >>>>>
> > >>>>> Serialization/deserialization of string in communication messages
> may
> > >>>>> be
> > >>>>> broken for some strings on nodes running in a JVM with a different
> > >>>>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
> > >>>>> serialize strings - [1]
> > >>>>>
> > >>>>> Or the IgniteAuthenticationProcessor can compute different security
> > >>>>> IDs
> > >>>>> for the user on different nodes in this case - [2]
> > >>>>>
> > >>>>> What do you think, if we solve this problem globally, by rejecting
> to
> > >>>>> join nodes that run on JVMs with different encodings?
> > >>>>>
> > >>>>> As a result, we will be sure that all cluster nodes have the same
> > >>>>> encoding and all related problems will be solved.
> > >>>>>
> > >>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106
> > >>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068
> > >>>>>
> > >>>>> --
> > >>>>> Mikhail
> > >>>>>
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Sincerely yours, Ivan Daschinskiy
> > >>>
> > >>>
> > >>
> > >> --
> > >> Sincerely yours, Ivan Daschinskiy
> > >>
> > >
> > >
> > > --
> > >
> > > Best regards,
> > > Ivan Pavlukhin
> >
> >
>
> --
> Sincerely yours, Ivan Daschinskiy
>

Reply via email to