Sven Many thanks for the quick response. I always like to try to solve problems myself before appealing for help, so I had worked out what was wrong, but did not know how to tell Zinc to use a specific coding. I had tried by reading through your very full note on Zinc, but did not find the trick you describe - which works perfectly, of course.
It seems unfortunate that Zinc does not use the coding specified in the html head. Evidently browsers like Firefox must do it, since the page displays correctly. If it cannot be done, I think it would be helpful to reconsider the error message produced when the user is dumped out, because in this context it is misleading. I spent some time tracing debugger output, trying to work out what was wrong with the UTF-8, before I spotted that one of the bytes was displayed in character form as $ö, and began to suspect it might be a different coding; I finally confirmed this by reading the page source in Firefox. Thanks again for your help. Peter Kenny -----Original Message----- From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of Sven Van Caekenberghe Sent: 08 May 2015 20:04 To: Any question about pharo is welcome Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1) Peter, Thanks for the URL, it makes it much easier to help you. The answer is easy: the server is incorrect, it serves a specific encoding without saying so. Consider: (ZnClient new head: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk'; response) contentType. => 'text/html' If no charset/encoding is specified, the modern default is UTF-8, so Zn tries that but fails. You can change the default for unspecified encoding as follows: ZnDefaultCharacterEncoder value: ZnByteEncoder iso88591 during: [ ZnClient new get: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk' ]. The server should have used the following mime type to avoid the confusion: ZnMimeType textHtml charSet: #iso88591 => 'text/html;charset=iso88591' HTH, Sven PS: the encoding inside the document cannot be used because (1) no interpretation inside documents is done and (2) at that point it is too late, the contents is already converted from bytes to characters > On 08 May 2015, at 18:51, PBKResearch <pe...@pbkresearch.co.uk> wrote: > > Hello > > I have been trying to use Soup class>> fromUrl: to access the contents of a > web page. It halts with a message from Zinc about malformed UTF-8. The page > displays perfectly in Firefox, so I copied the page source from there to a > local file and tried to read it from there. Again a message from Zinc: > 'Invalid utf8 input detected'. It’s strange, because the page is not in > UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" > http-equiv="Content-Type">. I have tried to find how to specify the character > set in reading files with Zinc, but without success.* > > If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two > days ago. The address of the web page is: > http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk. > Other pages from the same source are loaded and analysed with no problem. > Processing this page seems to go off course as soon as it encounters the > character code 246, which is a correct o-umlaut in ISO-8859-1. > > Any advice gratefully received. > > Peter Kenny > > *I would be happy with advice to RTFM, if someone would point out the > relevant bit of the FM.