Re: [Pharo-users] Class name with diacritic character and Pharo

Dominique Dartois Sun, 27 Jan 2019 14:24:04 -0800

Hi Sven.
Thank you for the time spend for your reply.
I have tried the same code in Pharo 6.1 (21.0) instead of 7.1.0 and I had NO 
problem.
It seems the class implementation for ZnUTF8Encoder is different.
Thanks again.
---
Dominique


> Le 27 janvier 2019 à 17:03, Sven Van Caekenberghe <s...@stfx.eu> a écrit :
> 
> 
> Hi Dominique,
> 
> > On 27 Jan 2019, at 11:40, Dominique Dartois <d...@dartois.org> wrote:
> > 
> > Hello all. 
> > If a use french diacritic character in a class name, the code runs but I 
> > can’t fileout the package nor save it with Monticello. 
> > For example, the C cedilla in the class name drive me to an 
> > ‘ZnInvalidUTF8:Illegal byte for utf-8 encoding' when filing out.
> > 
> > Is it a bug or a feature?
> > Thank you
> > 
> > <image.png>--- 
> > Dominique Dartois 
> 
> Thanks for reporting this. This is most definitely a bug, I can confirm its 
> occurrence.
> 
> I'm CC pharo-dev as this is quite important. This will be a long mail.
> 
> 
> This is one manifestation of a problem that has been present for quite a 
> while.
> 
> I'll start by describing what I did, what went well and where/how this fails, 
> some generic points, and two conceptual solutions (that need further 
> verification).
> 
> Like you, I created a new subclass:
> 
> Object subclass: #ClasseFrançaise
>       instanceVariableNames: ''
>       classVariableNames: ''
>       package: '_UnpackagedPackage'
> 
> With comment:
> 
> I am ClasseFrançaise.
> 
> Try:
> 
>       ClasseFrançaise new élève.
>       ClasseFrançaise new euro.
> 
> And two methods (in the 'test' protocol):
> 
> élève
>       ^ 'élève'
> 
> euro
>       ^ '€'
> 
> I added the euro sign (because that is encoded in UTF-8 with 3 bytes, not 2 
> like ç).
> Like you said, the system can cope with such class and method names and seems 
> to function fine.
> 
> Looking at the .changes file, the correct source code was appended:
> 
> ----SNAPSHOT----2019-01-26T23:36:18.548555+01:00 work.image priorSource: 
> 339848!
> 
> Object subclass: #ClasseFrançaise
>         instanceVariableNames: ''
>         classVariableNames: ''
>         package: '_UnpackagedPackage'!
> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:25' prior: 0!
> I am ClasseFrançaise.!
> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 
> 12:26'!
> élève
>         ^ 'élève'! !
> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:27' prior: 
> 33898360!
> I am ClasseFrançaise.
> 
> Try:
> 
>         ClasseFrançaise new élève.
>         ClasseFrançaise new euro.
> !
> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 
> 12:27'!
> euro
>       ^ '€'! !
> 
> 
> Doing a file out (or otherwise saving the source code) fails. The reason is 
> an incorrect manipulation of this source file while looking for what is 
> called the method preamble, in SourcFileArray>>#getPreambleFrom:at: position
> 
> An programmatic way to invoke the same error is by doing
> 
> (ClasseFrançaise>>#élève) timeStamp.
> (ClasseFrançaise>>#élève) author.
> 
> Both fail with the same error.
> 
> 
> The source code of methods is (currently) stored in a .sources or .changes 
> file. CompiledMethods know their source pointer, an offset in one of these 
> files. Right before the place where the source starts is a preamble that 
> contains some meta information (including the author and timestamp). To 
> access that preamble, the source code pointer is moved backwards to the 
> beginning of the preamble (which begins and ends with a !).
> 
> 
> The current approach fails in the presence of non-ASCII characters. More 
> specifically because of a mixup between the concept of byte position and 
> character position when using UTF-8, a variable length encoding (both the 
> .changes and the .sources are UTF-8 encoded).
> 
> For example, consider
> 
> 'à partir de 10 €' size. "16"
> 'à partir de 10 €' utf8Encoded size. "19"
> 
> So although the string contains 16 characters, it is encoded as 19 bytes, à 
> using 2 bytes and € using 3 bytes. In general, moving backwards or forwards 
> in UTF-8 encoded bytes cannot be done without understanding UTF-8 itself.
> 
> ZnUTF8Encoder can do both (moving forward is #nextFromStream: while moving 
> backwards is #backOnStream:). However, ZnUTF8Encoder is also strict: it will 
> signal an error when forced to operate in between encoded characters, which 
> is what happens here.
> 
> It is thus not possible to move to arbitrary bytes positions and assume/hope 
> to always arrive on the correct character boundaries and it is also wrong to 
> take the difference between two byte positions as the count of characters 
> present (since their encoding is of variable length).
> 
> SourcFileArray>>#getPreambleFrom:at: is doing both of these wrong (but gets 
> away with it in 99.99% of all cases since very few people name their classes 
> like that).
> 
> There are two solutions: operate mostly on the byte level or operate 
> correctly on the character level. Here are two conceptual solutions (you must 
> execute either solution 1 or 2, not both), with two different inputs.
> 
> 
> src := '!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 
> 1/27/2019 12:27''!
> euro
>       ^ ''€''! !'.
> 
> "startPosition := 83"
> 
> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
> str position: 83. "at start of euro, the methods source string"
> str upToEnd.
> 
> str position: (83 - 3). "before ! before euro"
> 
> "find the previous ! before position"
> position := str position.
> binary := str wrappedStream.
> encoder := str encoder.
> 
> "solution 1"
> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] 
> whileTrue: [ position := position - 1 ].
> position.
> encoder decodeBytes: (binary next: 80 - position).
> 
> "solution 2"
> count := 0.
> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ 
>       encoder backOnStream: binary; backOnStream: binary. count := count + 1 
> ].
> str position.
> str next: count.
> 
> 
> Same code, different input (and starting position):
> 
> 
> src := '!ABC!ClasseFrançaise methodsFor: ''test'' stamp: 
> ''SvenVanCaekenberghe 1/27/2019 12:27''!
> euro
>       ^ ''€''! !'.
> 
> "startPosition := 87"
> 
> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
> str position: 87. "at start of euro, the methods source string"
> str upToEnd.
> 
> str position: (87 - 3). "before ! before euro"
> 
> "find the previous ! before position"
> position := str position.
> binary := str wrappedStream.
> encoder := str encoder.
> 
> "solution 1"
> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] 
> whileTrue: [ position := position - 1 ].
> position.
> encoder decodeBytes: (binary next: 80 - position).
> 
> "solution 2"
> count := 0.
> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ 
>       encoder backOnStream: binary; backOnStream: binary. count := count + 1 
> ].
> str position.
> str next: count.
> 
> 
> Solution 1 (at the byte level) works because $! (ASCII code 33) is also 
> encoded as such in UTF-8 and this byte pattern cannot be part of a 2, 3 or 4 
> byte UTF-8 encoded character. The final byte segment must then be given to 
> the proper encoder to turn it into characters.
> 
> Solution 2 (at the character level) works because it essentially counts 
> characters while moving backwards correctly (it moves back 2 steps each time 
> because #next moves 1 step forward, while #peek would not help).
> 
> Note that in Solution 1 the moving position is held in an external variable, 
> while in Solution 2 it is held inside the stream.
> 
> 
> Implementing either solution requires more internal access of the streams 
> (#wrappedStream), so SourcFileArray>>#getPreambleFrom:at: should at least be 
> moved to SourceFile, IMHO.
> 
> Obviously this is a sensitive/critical area to touch.
> 
> We will also need a regression unit test.
> 
> On a higher design level, I think that CompiledMethod>>#timeStamp and 
> CompiledMethod>>#author should be cached at the instance level (a unix 
> timestamp and a symbol would cost very little).
> 
> Sven
> 
> 
> 
> 
> 
> 
>

Re: [Pharo-users] Class name with diacritic character and Pharo

Reply via email to