Monty Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.
However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading. It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active. Thanks again Peter Kenny -----Original Message----- From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of monty Sent: 15 May 2017 12:15 To: pharo-users@lists.pharo.org Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.) > Sent: Friday, May 12, 2017 at 5:30 AM > From: PBKResearch <pe...@pbkresearch.co.uk> > To: "'Any question about pharo is welcome'" > <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte > for utf-8 encoding > > With reference to Norbert's comment, there /may/ be an ambiguity about > the word 'header' in Udo's reply. It could refer to the http HEAD > section, in which case Norbert is of course right. It could also refer > to the <head> section of the html file, which is part of the content of the > http response. > If it is the latter, this is similar to a question that Paul > deBruicker posted last November ("[Pharo-users] ZnClient GET, but just > the content of the <head> tag?"). I tried the method I devised for > Paul's case on Udo's problem website, and read the html header with no > problem. Incidentally, the header includes 'charset=iso-8859-1', which > does not agree with Sven's findings. > > In case it is of interest, I used XMLHTMLParser to read and parse the > header. Try the following in a Playground: > > par := XMLHTMLParser onURL: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'. > par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top > isElement and:[ top isNamed: 'body']]]. > par parsingResult findElementNamed: 'head'. > > If you 'Do it and go', the full header appears. The way I get it to > stop after the header may not be quite correct, because it uses > XMLHTMLParser>>topNode, which is a private method. On the other hand, > XMLHTMLParser>>I > can't see how to make the stop condition for > XMLHTMLParser>>parseDocumentUntil: depend on the parsed results > XMLHTMLParser>>without > using a private method. > > Hope this is helpful > > Peter Kenny > > -----Original Message----- > From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On > Behalf Of Norbert Hartl > Sent: 12 May 2017 08:04 > To: Any question about pharo is welcome <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte > for > utf-8 encoding > > Just to mention. If you are not interested in the content body you > could do a HEAD request instead of GET. > > Norbert > > > Am 11.05.2017 um 22:44 schrieb Udo Schneider > <udo.schnei...@homeaddress.de>: > > > > Hi Sven, > > > > that's perfect. To be honest I don't care about the content - I'm > > just > parsing the header. And even if there is a wrong decoding in there... > I can live with that. > > > > Thank you very very much! For your help but also your stuff in general. > > > > CU, > > > > Udo > > > > > >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: > >> Hi Udo, > >>> On 11 May 2017, at 21:37, Udo Schneider > >>> <udo.schnei...@homeaddress.de> > wrote: > >>> > >>> All, > >>> > >>> I'm hitting an error where fetching web content fails. The website > >>> does > indeed use invalid characters. > >>> > >>> The easiest way to reproduce: > >>> > >>> ZnEasy get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723' > >>> > >>> Is there any way to tell Zinc to simply ignore that error and to > continue? > >>> > >>> CU, > >>> > >>> Udo > >> That server/page has a mime-type text/plain with no explicit > >> encoding > (charset) setting, so we have to guess. Like utf-8, pure > latin1/iso88591 does not work. The following does work, but you can't > be sure everything went well (beLenient takes some bytes as they are). > >> ZnDefaultCharacterEncoder > >> value: ZnCharacterEncoder latin1 beLenient > >> during: [ > >> ZnClient new > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself ]. > >> I added some API earlier today, so that the following should also > >> work > (you need to load Zn #bleedingEdge first). > >> ZnClient new > >> defaultEncoder: ZnCharacterEncoder latin1 beLenient; > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself. > >> HTH, > >> Regards, > >> Sven > > > > > > > > > >