[Pharo-users] Re: How to handle (recover) from a ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding error?

Sven Van Caekenberghe Tue, 20 Jul 2021 07:00:05 -0700

There is ZnCharacterEncoder knownEncodingIdentifiers.

You either provide an identifier from this list (as string or symbol) or an 
instance (the argument gets sent #asZnCharacterEncoder if you want to know).


Most text editors will tell you the encoding they are using to read your file 
and you can use that to inspect the contents.

If you want, you can sent me such a file privately.

Yes, you can access the encoder from the character read stream to configure it 
further. Or you do it upfront as an instance instead of an identifier.

> On 20 Jul 2021, at 15:47, Tim Mackinnon <tim@testit.works> wrote:
> 
> Hey thanks guys - so looking at readStreamEncoded: - how do I know what the 
> valid encodings are? Skimming those doc’s Sven referenced, I can start to 
> pick out some - but is there a list? I see that method parameter says 
> “anEncoding” but the type hint on that is misleading as it seems like its a 
> String or is it a Symbol? If I search for Encoder classes - I do find 
> ZnCharacterEncoder - and it has class methods for latin1, utf8, ascii - so is 
> this the definitive list? And should the encoding strings used in those 
> methods be constants or something I can reference in my code?
> 
> Gosh - this raises a whole host of things I just naively assumed happened for 
> me.
> 
> So it looks like the file giving me issues - seems to have characters like £ 
> or ¬ in it. So I’m wondering how I know what the proper encoding format would 
> be (I think these files were written out with some PHP app) - is it just a 
> trial and error thing?
> 
> I tried changing my code to:
> 
> details parseStream: (firmEfs readStreamEncoded: 'iso-8859-1’). - and other 
> variants like ‘ASCII’ and ‘latin1’ - and this then gives me another error:
> "ZnCharacterEncodingError: Character Unicode code point outside encoder range”
> 
> So it does sound like I have a file that isn’t conforming to known standards 
> - and I guess I have to use #beLenient option.
> 
> Sven - In the examples for using #beLenient - you seem to show something that 
> assumes you will iterate with Do - as my existing code takes a stream, that 
> it wants to do a #nextLine on - would it be bad to do something like this:
> 
> efsStream := (firmEfs readStreamEncoded: 'latin1').
> efsStream encoder beLenient.
> 
> details parsStream: efsStream.
> 
> That is - get the endcoder from my Stream and make it lenient? 
>        
> Appreciate the pointers on this guys - I’m definitely learning something new 
> here.
> 
> Tim
> 
>> On 20 Jul 2021, at 12:11, Guillermo Polito <guillermopol...@gmail.com> wrote:
>> 
>> 
>> 
>>> El 20 jul 2021, a las 11:45, Sven Van Caekenberghe <s...@stfx.eu> escribió:
>>> 
>>> 
>>> 
>>>> On 20 Jul 2021, at 11:03, Sven Van Caekenberghe <s...@stfx.eu> wrote:
>>>> 
>>>> Hi Tim,
>>>> 
>>>> An introduction to this part of the system is in 
>>>> https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html
>>>>  [Character Encoding and Resource Meta Description] from the "Enterprise 
>>>> Pharo" book.
>>>> 
>>>> The error means that a file that you try to read as UTF-8 does contain 
>>>> things that are invalid with respect to the UTF-8 standard.
>>>> 
>>>> Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or 
>>>> something else ?
>>>> 
>>>> It is possible to customise the encoding to something different than the 
>>>> default UTF-8. For non-UTF encoders, there is a strict/lenient option to 
>>>> disallow/allow illegal stuff (but then you will get these in your strings).
>>>> 
>>>> I can show you how to do that if you want.
>>> 
>>> '/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ].
>>> 
>>> '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
>>>     (ZnCharacterReadStream on: in encoding: #ascii) upToEnd ].
>>> 
>>> '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
>>>     (ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii 
>>> beLenient) upToEnd ].
>> 
>> There is also readStreamEncoded:[do:], which is a bit more concise but does 
>> the same :)
>> 
>>> 
>>> HTH
>>> 
>>>> Sven
>>>> 
>>>>> On 20 Jul 2021, at 10:31, Tim Mackinnon <tim@testit.works> wrote:
>>>>> 
>>>>> Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an 
>>>>> unexpected error and am wondering what the best way to approach it is.
>>>>> 
>>>>> It seems that I have a log file that has unexpected characters, and so my 
>>>>> readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal 
>>>>> continuation byte for utf-8 encoding”.
>>>>> 
>>>>> For some reason this file (unlike my others) seems to contain characters 
>>>>> that it shouldn’t - but what is the best way for me to continue 
>>>>> processing? Should I be opening my files in a different way - or can I 
>>>>> resume the error somehow- I’m not familiar with this area of Pharo and am 
>>>>> after a bit of advice.
>>>>> 
>>>>> My code is like this (and I get the error when doing nextLine)
>>>>> 
>>>>> 
>>>>> parseStream: aFileStream with: aBlock
>>>>>   | line items |
>>>>>   [ (line := aFileStream nextLine) isNil ]
>>>>>           whileFalse: [ 
>>>>>                   items := $/ split: line.
>>>>>                   items size = 3 ifTrue: [aBlock value: items]]
>>>>> 
>>>>> My stream is created like this:
>>>>> 
>>>>> firmEfs := (pathName , '/' , firmName , '_files') asFileReference.
>>>>> details parseStream: firmEfs readStream.
>>>>> 
>>>>> 
>>>>> Should I be opening the stream a bit differently - or can I catch that 
>>>>> encoding error and resume it with some safe character?
>>>>> 
>>>>> Thanks for any help.
>>>>> 
>>>>> Tim
>

[Pharo-users] Re: How to handle (recover) from a ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding error?

Reply via email to