Sven

Many thanks for the quick response. I always like to try to solve problems 
myself before appealing for help, so I had worked out what was wrong, but did 
not know how to tell Zinc to use a specific coding. I had tried by reading 
through your very full note on Zinc, but did not find the trick you describe - 
which works perfectly, of course.

It seems unfortunate that Zinc does not use the coding specified in the html 
head. Evidently browsers like Firefox must do it, since the page displays 
correctly. If it cannot be done, I think it would be helpful to reconsider the 
error message produced when the user is dumped out, because in this context it 
is misleading. I spent some time tracing debugger output, trying to work out 
what was wrong with the UTF-8, before I spotted that one of the bytes was 
displayed in character form as $ö, and began to suspect it might be a different 
coding; I finally confirmed this by reading the page source in Firefox.

Thanks again for your help.

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of 
Sven Van Caekenberghe
Sent: 08 May 2015 20:04
To: Any question about pharo is welcome
Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)

Peter,

Thanks for the URL, it makes it much easier to help you.

The answer is easy: the server is incorrect, it serves a specific encoding 
without saying so.

Consider:

(ZnClient new 
   head: 
'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk';
 
   response) contentType.

 => 'text/html'

If no charset/encoding is specified, the modern default is UTF-8, so Zn tries 
that but fails.

You can change the default for unspecified encoding as follows:

ZnDefaultCharacterEncoder 
  value: ZnByteEncoder iso88591
  during: [ 
    ZnClient new 
      get: 
'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk'
 ].

The server should have used the following mime type to avoid the confusion:

ZnMimeType textHtml charSet: #iso88591
 
  => 'text/html;charset=iso88591'

HTH,

Sven

PS: the encoding inside the document cannot be used because (1) no 
interpretation inside documents is done and (2) at that point it is too late, 
the contents is already converted from bytes to characters

> On 08 May 2015, at 18:51, PBKResearch <pe...@pbkresearch.co.uk> wrote:
> 
> Hello
>  
> I have been trying to use Soup class>> fromUrl: to access the contents of a 
> web page. It halts with a message from Zinc about malformed UTF-8. The page 
> displays perfectly in Firefox, so I copied the page source from there to a 
> local file and tried to read it from there. Again a message from Zinc: 
> 'Invalid utf8 input detected'. It’s strange, because the page is not in 
> UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" 
> http-equiv="Content-Type">. I have tried to find how to specify the character 
> set in reading files with Zinc, but without success.*
>  
> If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two 
> days ago. The address of the web page is: 
> http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk.
>  Other pages from the same source are loaded and analysed with no problem. 
> Processing this page seems to go off course as soon as it encounters the 
> character code 246, which is a correct o-umlaut in ISO-8859-1.
>  
> Any advice gratefully received.
>  
> Peter Kenny
>  
> *I would be happy with advice to RTFM, if someone would point out the 
> relevant bit of the FM.



Reply via email to