Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

PBKResearch Fri, 12 May 2017 02:32:18 -0700

With reference to Norbert's comment, there /may/ be an ambiguity about the
word 'header' in Udo's reply. It could refer to the http HEAD section, in
which case Norbert is of course right. It could also refer to the <head>
section of the html file, which is part of the content of the http response.
If it is the latter, this is similar to a question that Paul deBruicker
posted last November ("[Pharo-users] ZnClient GET, but just the  content of
the <head> tag?"). I tried the method I devised for Paul's case on Udo's
problem website, and read the html header with no problem. Incidentally, the
header includes 'charset=iso-8859-1', which does not agree with Sven's
findings.


In case it is of interest, I used XMLHTMLParser to read and parse the
header. Try the following in a Playground:

par := XMLHTMLParser onURL:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
isElement and:[ top isNamed: 'body']]].
par parsingResult findElementNamed: 'head'.

If you 'Do it and go', the full header appears. The way I get it to stop
after the header may not be quite correct, because it uses
XMLHTMLParser>>topNode, which is a private method. On the other hand, I
can't see how to make the stop condition for
XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without
using a private method.

Hope this is helpful

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[email protected]] On Behalf Of
Norbert Hartl
Sent: 12 May 2017 08:04
To: Any question about pharo is welcome <[email protected]>
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
utf-8 encoding

Just to mention. If you are not interested in the content body you could do
a HEAD request instead of GET. 

Norbert

> Am 11.05.2017 um 22:44 schrieb Udo Schneider
<[email protected]>:
> 
> Hi Sven,
> 
> that's perfect. To be honest I don't care about the content - I'm just
parsing the header. And even if there is a wrong decoding in there... I can
live with that.
> 
> Thank you very very much! For your help but also your stuff in general.
> 
> CU,
> 
> Udo
> 
> 
>> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
>> Hi Udo,
>>> On 11 May 2017, at 21:37, Udo Schneider <[email protected]>
wrote:
>>> 
>>> All,
>>> 
>>> I'm hitting an error where fetching web content fails. The website does
indeed use invalid characters.
>>> 
>>> The easiest way to reproduce:
>>> 
>>> ZnEasy get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>>> 
>>> Is there any way to tell Zinc to simply ignore that error and to
continue?
>>> 
>>> CU,
>>> 
>>> Udo
>> That server/page has a mime-type text/plain with no explicit encoding
(charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591
does not work. The following does work, but you can't be sure everything
went well (beLenient takes some bytes as they are).
>> ZnDefaultCharacterEncoder
>>   value: ZnCharacterEncoder latin1 beLenient
>>   during: [
>>     ZnClient new
>>       get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>>       yourself ].
>> I added some API earlier today, so that the following should also work
(you need to load Zn #bleedingEdge first).
>>  ZnClient new
>>   defaultEncoder: ZnCharacterEncoder latin1 beLenient;
>>   get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>>   yourself.
>> HTH,
>> Regards,
>> Sven
> 
> 
>

Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Reply via email to