With reference to Norbert's comment, there /may/ be an ambiguity about the
word 'header' in Udo's reply. It could refer to the http HEAD section, in
which case Norbert is of course right. It could also refer to the <head>
section of the html file, which is part of the content of the http response.
If it is the latter, this is similar to a question that Paul deBruicker
posted last November ("[Pharo-users] ZnClient GET, but just the content of
the <head> tag?"). I tried the method I devised for Paul's case on Udo's
problem website, and read the html header with no problem. Incidentally, the
header includes 'charset=iso-8859-1', which does not agree with Sven's
findings.
In case it is of interest, I used XMLHTMLParser to read and parse the
header. Try the following in a Playground:
par := XMLHTMLParser onURL:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
isElement and:[ top isNamed: 'body']]].
par parsingResult findElementNamed: 'head'.
If you 'Do it and go', the full header appears. The way I get it to stop
after the header may not be quite correct, because it uses
XMLHTMLParser>>topNode, which is a private method. On the other hand, I
can't see how to make the stop condition for
XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without
using a private method.
Hope this is helpful
Peter Kenny
-----Original Message-----
From: Pharo-users [mailto:[email protected]] On Behalf Of
Norbert Hartl
Sent: 12 May 2017 08:04
To: Any question about pharo is welcome <[email protected]>
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
utf-8 encoding
Just to mention. If you are not interested in the content body you could do
a HEAD request instead of GET.
Norbert
> Am 11.05.2017 um 22:44 schrieb Udo Schneider
<[email protected]>:
>
> Hi Sven,
>
> that's perfect. To be honest I don't care about the content - I'm just
parsing the header. And even if there is a wrong decoding in there... I can
live with that.
>
> Thank you very very much! For your help but also your stuff in general.
>
> CU,
>
> Udo
>
>
>> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
>> Hi Udo,
>>> On 11 May 2017, at 21:37, Udo Schneider <[email protected]>
wrote:
>>>
>>> All,
>>>
>>> I'm hitting an error where fetching web content fails. The website does
indeed use invalid characters.
>>>
>>> The easiest way to reproduce:
>>>
>>> ZnEasy get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>>>
>>> Is there any way to tell Zinc to simply ignore that error and to
continue?
>>>
>>> CU,
>>>
>>> Udo
>> That server/page has a mime-type text/plain with no explicit encoding
(charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591
does not work. The following does work, but you can't be sure everything
went well (beLenient takes some bytes as they are).
>> ZnDefaultCharacterEncoder
>> value: ZnCharacterEncoder latin1 beLenient
>> during: [
>> ZnClient new
>> get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>> yourself ].
>> I added some API earlier today, so that the following should also work
(you need to load Zn #bleedingEdge first).
>> ZnClient new
>> defaultEncoder: ZnCharacterEncoder latin1 beLenient;
>> get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>> yourself.
>> HTH,
>> Regards,
>> Sven
>
>
>