Monty As an update, I have rebuilt from the Moose 6.0 download. The version of XML-Parser in that was dated 18 July 2016 (configuration monty.233), so I installed versions of XML-Parser-HTML and XML-Parser-StAX contemporary with that. (The respective configurations are monty.48 and monty.39). With these versions all my previous XMLHTMLParser operations work as before, and I have been able to use the StAX parser in a simple way. So I can start exploring as I intended.
I have made repeated attempts to update this rebuilt image to more recent versions of the HTML and StAX parsers, and every time I run into the same error reported below. I started from the latest version and worked backwards, but gave up quickly; it takes about 6 minutes on my machine to load and compile a version, and it soon gets tedious. If I feel more enthusiastic tomorrow, I might start working forwards from my current versions. Anyway, I now have a working system with the StaX and HTML parsers, so I can continue to explore. Best wishes Peter Kenny -----Original Message----- From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of PBKResearch Sent: 15 May 2017 20:44 To: 'Any question about pharo is welcome' <pharo-users@lists.pharo.org> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding Monty I have just started trying to use the StAX parsers, and I have found that the update has introduced a problem, which means that XMLHTMLParser no longer works on examples I have used before. I updated to ConfigurationOfXMLParser(monty.302), which is the latest version on the smalltalkhub repository, and then used the load version in the class comment, which loads the stable default. Similarly, I loaded ConfigurationOfXMLParserHTML(monty.62) and ConfigurationOfXMLParserStAX(monty.51), again using stable and default. When I try to run the XMLHTMLParser example I quoted below, I get an error message 'MessageNotunderstood: receiver of "critical:" is nil'. The same message comes up with anything else I try with XMLHTMLParser or with StAXHTMLParser. I am not really up to using the debugger on someone else's code, but the one thing I can see is that the problem lies in XMLKeyValueCache>>critical:, which has the code: ^ self mutex critical: aBlock The problem being that mutex is nil. In my enthusiasm, I saved the updated image with the same name as the old image, which is now therefore overwritten. If I cannot solve this problem, my only way out is to rebuild my image from the Moose 6.0 download. Any suggestions gratefully received. Thanks in advance Peter Kenny -----Original Message----- From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of PBKResearch Sent: 15 May 2017 19:16 To: 'Any question about pharo is welcome' <pharo-users@lists.pharo.org> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding Monty Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun. However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading. It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active. Thanks again Peter Kenny -----Original Message----- From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of monty Sent: 15 May 2017 12:15 To: pharo-users@lists.pharo.org Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)