> On 09 May 2015, at 02:18, PBKResearch <pe...@pbkresearch.co.uk> wrote:
> 
> Sven
> 
> Many thanks for the quick response. I always like to try to solve problems 
> myself before appealing for help, so I had worked out what was wrong, but did 
> not know how to tell Zinc to use a specific coding. I had tried by reading 
> through your very full note on Zinc, but did not find the trick you describe 
> - which works perfectly, of course.

Good, yes this is a more recent thing.

> It seems unfortunate that Zinc does not use the coding specified in the html 
> head. Evidently browsers like Firefox must do it, since the page displays 
> correctly. If it cannot be done, I think it would be helpful to reconsider 
> the error message produced when the user is dumped out, because in this 
> context it is misleading. I spent some time tracing debugger output, trying 
> to work out what was wrong with the UTF-8, before I spotted that one of the 
> bytes was displayed in character form as $ö, and began to suspect it might be 
> a different coding; I finally confirmed this by reading the page source in 
> Firefox.

Zn deals with HTTP, not with HTML, these are totally different things, a 
browser obviously does both. But even then there is no easy way to do this, 
apart from trying. Consider these two byte arrays:

#[85 84 70 56 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 195 182 108 
108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 111 114 115 
195 164 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 46]

#[73 83 79 56 56 53 57 49 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 
246 108 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 111 
114 115 228 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 46]

In them it says how you should decode them! 

The GT tools make this challenge easy because there is a tab that tries both 
encodings, but in general this is hard to solve (efficiently).

But since Zn does not do HTML, it will never be added at that level.

I will think about the error, it might indeed be useful to tell the user that a 
default encoding was chosen.

> Thanks again for your help.

You're welcome.

> Peter Kenny
> 
> -----Original Message-----
> From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of 
> Sven Van Caekenberghe
> Sent: 08 May 2015 20:04
> To: Any question about pharo is welcome
> Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)
> 
> Peter,
> 
> Thanks for the URL, it makes it much easier to help you.
> 
> The answer is easy: the server is incorrect, it serves a specific encoding 
> without saying so.
> 
> Consider:
> 
> (ZnClient new 
>   head: 
> 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk';
>  
>   response) contentType.
> 
> => 'text/html'
> 
> If no charset/encoding is specified, the modern default is UTF-8, so Zn tries 
> that but fails.
> 
> You can change the default for unspecified encoding as follows:
> 
> ZnDefaultCharacterEncoder 
>  value: ZnByteEncoder iso88591
>  during: [ 
>    ZnClient new 
>      get: 
> 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk'
>  ].
> 
> The server should have used the following mime type to avoid the confusion:
> 
> ZnMimeType textHtml charSet: #iso88591
> 
>  => 'text/html;charset=iso88591'
> 
> HTH,
> 
> Sven
> 
> PS: the encoding inside the document cannot be used because (1) no 
> interpretation inside documents is done and (2) at that point it is too late, 
> the contents is already converted from bytes to characters
> 
>> On 08 May 2015, at 18:51, PBKResearch <pe...@pbkresearch.co.uk> wrote:
>> 
>> Hello
>> 
>> I have been trying to use Soup class>> fromUrl: to access the contents of a 
>> web page. It halts with a message from Zinc about malformed UTF-8. The page 
>> displays perfectly in Firefox, so I copied the page source from there to a 
>> local file and tried to read it from there. Again a message from Zinc: 
>> 'Invalid utf8 input detected'. It’s strange, because the page is not in 
>> UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" 
>> http-equiv="Content-Type">. I have tried to find how to specify the 
>> character set in reading files with Zinc, but without success.*
>> 
>> If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two 
>> days ago. The address of the web page is: 
>> http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk.
>>  Other pages from the same source are loaded and analysed with no problem. 
>> Processing this page seems to go off course as soon as it encounters the 
>> character code 246, which is a correct o-umlaut in ISO-8859-1.
>> 
>> Any advice gratefully received.
>> 
>> Peter Kenny
>> 
>> *I would be happy with advice to RTFM, if someone would point out the 
>> relevant bit of the FM.
> 
> 
> 


Reply via email to