Slava, great ticket! I suppose, that we can add feature flag to BPlusMetaIO and if it doesn't present or it is value is false, we can rebuild metastore during recovery and decode strings to default system encoding and save all of them back to UTF-8. After recovery, we should use UTF-8 by default.
чт, 16 дек. 2021 г. в 13:35, Вячеслав Коптилин <slava.kopti...@gmail.com>: > Hi folks, > > IMHO, we should do our best to fix all these places and should avoid using > the default charset. In my understanding, this is only > > > The main question is - should we restrict the join of nodes with > different encodings or just fix all places where implicit default encoding > is used and specify the explicit one as Ivan Daschinsky suggested? > Restricting the join of nodes is not a solution for all cases. You are in > trouble even though you use a one-node cluster. Just change the default > charset on your system and restart the node with existing PDS [1] > > > As for me, I'm expecting a way more problem with enforcing rule to fail, > rather than enforcing all components to use UTF-8 > Absolutely agree with Ivan. > > [1] https://issues.apache.org/jira/browse/IGNITE-16080 > > Thanks, > S. > > вт, 14 дек. 2021 г. в 10:52, Ivan Pavlukhin <vololo...@gmail.com>: > > > Do encodings in question somehow influence on actual stored data > > (bytes)? If so, using an implicit platform encoding sounds quite > > dangerous. Moving data between servers (or perhaps even rebalancing) > > can lead to bad consequences. Anyways, IMHO an implicit encoding is > > not good, but sensible default is quite robust. > > > > 2021-12-13 23:07 GMT+03:00, Ivan Daschinsky <ivanda...@gmail.com>: > > > Unpaited surrogates are emoji symbols. One should be completely insane > to > > > use emojis in login. > > > > > > пн, 13 дек. 2021 г., 21:30 Mikhail Petrov <pmgheap....@gmail.com>: > > > > > >> Ivan, string with unpaired surrogates symbols are serialized and > > >> deserialized by java UTF-8 decoder successfully but the result does > not > > >> match the initial string. It may result in that if the user's login > > >> contains these symbols, it will be distorted after deserialization and > > >> the user will not be able to log in. I understand that it is a quite > > >> rare case. > > >> Anyway, the way to solve this problem was introduced here - > > >> https://issues.apache.org/jira/browse/IGNITE-3098 > > >> > > >> Frankly, it is not the topic I would like to discuss now. The main > > >> question is - should we restrict the join of nodes with different > > >> encodings or just fix all places where implicit default encoding is > used > > >> and specify the explicit one as Ivan Daschinsky suggested? > > >> > > >> From my point of view, it is better to reject nodes with different > > >> encodings (especially after Ilya Kasnacheev mentioned that we already > > >> have a warning "Differing character encodings across cluster may lead > > >> to erratic behavior"). It will help to avoid "erratic behavior", not > > >> just warn about it. It is important since the problems related to > string > > >> encoding can occur in different components and the cause of them is > not > > >> always obvious. > > >> > > >> WDYT? > > >> > > >> On 13.12.2021 20:01, Ivan Pavlukhin wrote: > > >> >> I guess Nikolay is talking about the problem with UTF-8 in case > > string > > >> contains unpaired surrogate symbols > > >> > Folks, give me a clue why it is a problem? Naively it seems to be a > > >> > good restriction rather than problem. What problems can it cause in > > >> > practice? > > >> > > > >> > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev > > >> > <ilya.kasnach...@gmail.com>: > > >> >> Hello! > > >> >> > > >> >> We already have a warning about this, see > > >> IgniteKernal.checkFileEncoding() > > >> >> > > >> >> Regards, > > >> >> -- > > >> >> Ilya Kasnacheev > > >> >> > > >> >> > > >> >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky <ivanda...@gmail.com > >: > > >> >> > > >> >>>>> But now multiple components > > >> >>>>> independently serialize strings for their needs and use default > > >> >>>>> encoding > > >> >>>>> for this. > > >> >>>>> For example DirectByteBufferStreamImplV2#writeString, > > >> >>>>> MetaStorage#writeRaw and so on > > >> >>> We should fix all of them. > > >> >>> > > >> >>>>> BinaryUtils#utf8BytesToStr > > >> >>> Lets use this everywhere. > > >> >>> > > >> >>> As for me, I'm expecting a way more problem with enforcing rule to > > >> fail, > > >> >>> rather than enforcing all components to use UTF-8 > > >> >>> Some weird cases (surrogate pairs) we can (I strongly believe it > is > > >> OK) > > >> >>> simply do not consider at all. > > >> >>> > > >> >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <nizhi...@apache.org > >: > > >> >>> > > >> >>>>> Does Java String support all unicode characters and particularly > > >> >>>>> does > > >> >>> it > > >> >>>> support more characters than UTF-8 > > >> >>>> > > >> >>>> It’s not about Java, it’s about UTF-8 standard. > > >> >>>> > > >> >>>> Please, take a look at [1] > > >> >>>> > > >> >>>>> In November 2003, UTF-8 was restricted by RFC 3629 to match the > > >> >>>> constraints of the UTF-16 character encoding: explicitly > > prohibiting > > >> >>>> code > > >> >>>> points corresponding to the high and low surrogate characters > > >> >>>> removed > > >> >>> more > > >> >>>> than 3% of the three-byte sequences, and ending at U+10FFFF > removed > > >> >>>> more > > >> >>>> than 48% of the four-byte sequences and all five- and six-byte > > >> >>>> sequences. > > >> >>>> > > >> >>>> And [2] > > >> >>>> > > >> >>>>> The definition of UTF-8 prohibits encoding character numbers > > >> >>>>> between > > >> >>>> U+D800 and U+DFFF, which are reserved for use with the UTF-16 > > >> >>>> encoding > > >> >>> form > > >> >>>> (as surrogate pairs) and do not directly represent characters. > > >> >>>> > > >> >>>> Actually, we already has some modes to support this restriction > of > > >> >>>> UTF-8. > > >> >>>> Please, take a look at BinaryUtils#utf8BytesToStr [3] > > >> >>>> > > >> >>>> > > >> >>>> [1] https://en.wikipedia.org/wiki/UTF-8 > > >> >>>> [2] https://datatracker.ietf.org/doc/html/rfc3629 > > >> >>>> [3] > > >> >>>> > > >> >>> > > >> > > > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 > > >> >>>>> 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <vololo...@gmail.com> > > >> >>>> написал(а): > > >> >>>>>> UTF-8 can’t encode all UNICODE characters. > > >> >>>>> Nikolay, could you please elaborate? My understanding is that > > >> >>>>> encoding > > >> >>>>> we speak about matters for conversion from byte arrays to > strings. > > >> >>>>> Does Java String support all unicode characters and particularly > > >> >>>>> does > > >> >>>>> it support more characters than UTF-8 (I am not saying here that > > >> >>>>> java > > >> >>>>> String uses UTF-8)? > > >> >>>>> > > >> >>>>> 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky < > ivanda...@gmail.com > > >: > > >> >>>>>> UTF-8 is already a default encoding in our BinaryObject format. > > >> >>>>>> So.... > > >> >>>> I am > > >> >>>>>> for unification. > > >> >>>>>> > > >> >>>>>> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov > > >> >>>>>> <nizhi...@apache.org>: > > >> >>>>>> > > >> >>>>>>> Hello, Ivan. > > >> >>>>>>> > > >> >>>>>>> UTF-8 can’t encode all UNICODE characters. > > >> >>>>>>> > > >> >>>>>>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky < > ivanda...@gmail.com > > > > > >> >>>>>>> написал(а): > > >> >>>>>>>> Khm, maybe a better variant is to enforce all strings to be > > >> >>>>>>>> encoded > > >> >>>> in > > >> >>>>>>>> UTF-8? > > >> >>>>>>>> AFAIK multi OS cluster is a quite common case. > > >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov < > > >> pmgheap....@gmail.com > > >> >>>> : > > >> >>>>>>>>> Igniters, > > >> >>>>>>>>> > > >> >>>>>>>>> Recently we faced the problem that if the cluster consists > of > > >> >>>>>>>>> nodes > > >> >>>>>>>>> running in the JVM with different encodings, many issues > > arise. > > >> >>>>>>>>> The root cause of the mentioned issues is components that > use > > >> >>>>>>>>> `String#getBytes()` and `new String(<byte array>)`, which > > >> >>>>>>>>> relies > > >> >>>>>>>>> on > > >> >>>>>>>>> the > > >> >>>>>>>>> system default encoding. Thus, if a string is deserialized > on > > a > > >> >>> node > > >> >>>>>>>>> with a different encoding from the one that serialized it, > the > > >> >>>>>>>>> deserialized string can be different from the original one. > > >> >>>>>>>>> > > >> >>>>>>>>> For example: > > >> >>>>>>>>> > > >> >>>>>>>>> Serialization/deserialization of string in communication > > >> >>>>>>>>> messages > > >> >>> may > > >> >>>>>>>>> be > > >> >>>>>>>>> broken for some strings on nodes running in a JVM with a > > >> >>>>>>>>> different > > >> >>>>>>>>> encoding as DirectByteBufferStreamImplV2 uses > > String#getBytes() > > >> >>>>>>>>> to > > >> >>>>>>>>> serialize strings - [1] > > >> >>>>>>>>> > > >> >>>>>>>>> Or the IgniteAuthenticationProcessor can compute different > > >> >>>>>>>>> security > > >> >>>>>>>>> IDs > > >> >>>>>>>>> for the user on different nodes in this case - [2] > > >> >>>>>>>>> > > >> >>>>>>>>> What do you think, if we solve this problem globally, by > > >> >>>>>>>>> rejecting > > >> >>> to > > >> >>>>>>>>> join nodes that run on JVMs with different encodings? > > >> >>>>>>>>> > > >> >>>>>>>>> As a result, we will be sure that all cluster nodes have the > > >> >>>>>>>>> same > > >> >>>>>>>>> encoding and all related problems will be solved. > > >> >>>>>>>>> > > >> >>>>>>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106 > > >> >>>>>>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068 > > >> >>>>>>>>> > > >> >>>>>>>>> -- > > >> >>>>>>>>> Mikhail > > >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>> -- > > >> >>>>>>>> Sincerely yours, Ivan Daschinskiy > > >> >>>>>>> > > >> >>>>>> -- > > >> >>>>>> Sincerely yours, Ivan Daschinskiy > > >> >>>>>> > > >> >>>>> > > >> >>>>> -- > > >> >>>>> > > >> >>>>> Best regards, > > >> >>>>> Ivan Pavlukhin > > >> >>>> > > >> >>> -- > > >> >>> Sincerely yours, Ivan Daschinskiy > > >> >>> > > >> > > > >> > > > > > > > > > -- > > > > Best regards, > > Ivan Pavlukhin > > > -- Sincerely yours, Ivan Daschinskiy