Hi Sven. Thank you for the time spend for your reply. I have tried the same code in Pharo 6.1 (21.0) instead of 7.1.0 and I had NO problem. It seems the class implementation for ZnUTF8Encoder is different. Thanks again. --- Dominique
> Le 27 janvier 2019 à 17:03, Sven Van Caekenberghe <s...@stfx.eu> a écrit : > > > Hi Dominique, > > > On 27 Jan 2019, at 11:40, Dominique Dartois <d...@dartois.org> wrote: > > > > Hello all. > > If a use french diacritic character in a class name, the code runs but I > > can’t fileout the package nor save it with Monticello. > > For example, the C cedilla in the class name drive me to an > > ‘ZnInvalidUTF8:Illegal byte for utf-8 encoding' when filing out. > > > > Is it a bug or a feature? > > Thank you > > > > <image.png>--- > > Dominique Dartois > > Thanks for reporting this. This is most definitely a bug, I can confirm its > occurrence. > > I'm CC pharo-dev as this is quite important. This will be a long mail. > > > This is one manifestation of a problem that has been present for quite a > while. > > I'll start by describing what I did, what went well and where/how this fails, > some generic points, and two conceptual solutions (that need further > verification). > > Like you, I created a new subclass: > > Object subclass: #ClasseFrançaise > instanceVariableNames: '' > classVariableNames: '' > package: '_UnpackagedPackage' > > With comment: > > I am ClasseFrançaise. > > Try: > > ClasseFrançaise new élève. > ClasseFrançaise new euro. > > And two methods (in the 'test' protocol): > > élève > ^ 'élève' > > euro > ^ '€' > > I added the euro sign (because that is encoded in UTF-8 with 3 bytes, not 2 > like ç). > Like you said, the system can cope with such class and method names and seems > to function fine. > > Looking at the .changes file, the correct source code was appended: > > ----SNAPSHOT----2019-01-26T23:36:18.548555+01:00 work.image priorSource: > 339848! > > Object subclass: #ClasseFrançaise > instanceVariableNames: '' > classVariableNames: '' > package: '_UnpackagedPackage'! > !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:25' prior: 0! > I am ClasseFrançaise.! > !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 > 12:26'! > élève > ^ 'élève'! ! > !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:27' prior: > 33898360! > I am ClasseFrançaise. > > Try: > > ClasseFrançaise new élève. > ClasseFrançaise new euro. > ! > !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 > 12:27'! > euro > ^ '€'! ! > > > Doing a file out (or otherwise saving the source code) fails. The reason is > an incorrect manipulation of this source file while looking for what is > called the method preamble, in SourcFileArray>>#getPreambleFrom:at: position > > An programmatic way to invoke the same error is by doing > > (ClasseFrançaise>>#élève) timeStamp. > (ClasseFrançaise>>#élève) author. > > Both fail with the same error. > > > The source code of methods is (currently) stored in a .sources or .changes > file. CompiledMethods know their source pointer, an offset in one of these > files. Right before the place where the source starts is a preamble that > contains some meta information (including the author and timestamp). To > access that preamble, the source code pointer is moved backwards to the > beginning of the preamble (which begins and ends with a !). > > > The current approach fails in the presence of non-ASCII characters. More > specifically because of a mixup between the concept of byte position and > character position when using UTF-8, a variable length encoding (both the > .changes and the .sources are UTF-8 encoded). > > For example, consider > > 'à partir de 10 €' size. "16" > 'à partir de 10 €' utf8Encoded size. "19" > > So although the string contains 16 characters, it is encoded as 19 bytes, à > using 2 bytes and € using 3 bytes. In general, moving backwards or forwards > in UTF-8 encoded bytes cannot be done without understanding UTF-8 itself. > > ZnUTF8Encoder can do both (moving forward is #nextFromStream: while moving > backwards is #backOnStream:). However, ZnUTF8Encoder is also strict: it will > signal an error when forced to operate in between encoded characters, which > is what happens here. > > It is thus not possible to move to arbitrary bytes positions and assume/hope > to always arrive on the correct character boundaries and it is also wrong to > take the difference between two byte positions as the count of characters > present (since their encoding is of variable length). > > SourcFileArray>>#getPreambleFrom:at: is doing both of these wrong (but gets > away with it in 99.99% of all cases since very few people name their classes > like that). > > There are two solutions: operate mostly on the byte level or operate > correctly on the character level. Here are two conceptual solutions (you must > execute either solution 1 or 2, not both), with two different inputs. > > > src := '!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe > 1/27/2019 12:27''! > euro > ^ ''€''! !'. > > "startPosition := 83" > > str := ZnCharacterReadStream on: (src utf8Encoded readStream). > str position: 83. "at start of euro, the methods source string" > str upToEnd. > > str position: (83 - 3). "before ! before euro" > > "find the previous ! before position" > position := str position. > binary := str wrappedStream. > encoder := str encoder. > > "solution 1" > [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] > whileTrue: [ position := position - 1 ]. > position. > encoder decodeBytes: (binary next: 80 - position). > > "solution 2" > count := 0. > [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ > encoder backOnStream: binary; backOnStream: binary. count := count + 1 > ]. > str position. > str next: count. > > > Same code, different input (and starting position): > > > src := '!ABC!ClasseFrançaise methodsFor: ''test'' stamp: > ''SvenVanCaekenberghe 1/27/2019 12:27''! > euro > ^ ''€''! !'. > > "startPosition := 87" > > str := ZnCharacterReadStream on: (src utf8Encoded readStream). > str position: 87. "at start of euro, the methods source string" > str upToEnd. > > str position: (87 - 3). "before ! before euro" > > "find the previous ! before position" > position := str position. > binary := str wrappedStream. > encoder := str encoder. > > "solution 1" > [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] > whileTrue: [ position := position - 1 ]. > position. > encoder decodeBytes: (binary next: 80 - position). > > "solution 2" > count := 0. > [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ > encoder backOnStream: binary; backOnStream: binary. count := count + 1 > ]. > str position. > str next: count. > > > Solution 1 (at the byte level) works because $! (ASCII code 33) is also > encoded as such in UTF-8 and this byte pattern cannot be part of a 2, 3 or 4 > byte UTF-8 encoded character. The final byte segment must then be given to > the proper encoder to turn it into characters. > > Solution 2 (at the character level) works because it essentially counts > characters while moving backwards correctly (it moves back 2 steps each time > because #next moves 1 step forward, while #peek would not help). > > Note that in Solution 1 the moving position is held in an external variable, > while in Solution 2 it is held inside the stream. > > > Implementing either solution requires more internal access of the streams > (#wrappedStream), so SourcFileArray>>#getPreambleFrom:at: should at least be > moved to SourceFile, IMHO. > > Obviously this is a sensitive/critical area to touch. > > We will also need a regression unit test. > > On a higher design level, I think that CompiledMethod>>#timeStamp and > CompiledMethod>>#author should be cached at the instance level (a unix > timestamp and a symbol would cost very little). > > Sven > > > > > > >