Re: [Pharo-users] ZnClient GET, but just the content of the tag?

stepharo Sun, 27 Nov 2016 13:29:41 -0800

nice :)

On Sun, 27 Nov 2016 19:46:41 +0100, Jan Kurš <kurs....@gmail.com> wrote:

Hi,
PetitParser2 [1] supports parsing of streams. I have been experimentingwith ZnClient and come up with the following solution:
1) Create a PP2 stream from ZnClient stream:
byteStream := ZnClient new
 url: 'http://pharo.org';
 streaming: true;
 get.
stream := PP2CharacterStream on: byteStream encoder: ZnUTF8Encoder new.

2) Create a parser for header:
head := '<head>' asPParser, #any asPParser starLazy, '</head>' asPParser.
3) Create a parser that reads everything up till header or body (in caseheader is not present) and parse the header:
headStart := '<head' asPParser.
bodyStart := '<body' asPParser.
parser := (#any asPParser starLazy: (headStart / bodyStart)), head ==>#second.
result := parser optimize parse: stream.
4) Finally, the contents of header is a collection of characters, Idon't know what is the best way to convert it into a string, perhapsthis:text := (result second inject: (WriteStream on: '') into: [ :stream:char | stream nextPut: char. stream ]) contents
Cheers,
Jan

[1]: https://github.com/kursjan/petitparser2
On Sun, Nov 27, 2016 at 1:38 PM PBKResearch <pe...@pbkresearch.co.uk>wrote:
Paul
Not sure if this is helpful - I have not tried it out, but it may giveyou a
pointer.
As Sven says, you need to parse a stream and be able to stop when youreach
the desired point. If instead of Soup you use XMLHTMLParser, this has
streaming siblings called SAXHTMLHandler and SAX2HTMLParser. I think it
should be possible to use one or the other to stop when you reach the
</head> tag.
Personally I find the output of XMLHTMLParser easier to follow thanthat of
Soup, but this may be a matter of taste.

Hope this helps

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] OnBehalf Of
Sven Van Caekenberghe
Sent: 26 November 2016 18:19
To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] ZnClient GET, but just the content of the<head>
tag?

Paul,
On 26 Nov 2016, at 18:31, PAUL DEBRUICKER <pdebr...@gmail.com> wrote:

This is a micro optimization if there ever was one but I wondered if it
was possible to stop downloading and get the entity once the </head>tag has
been received.
Right now I download the whole page, parse it with Soup, then extractthe
tags I want from the head.  Which works fine.  e.g.
head:=((Soup fromString: (ZnEasy get: 'http://pharo.org') entity)
findChildTag: 'html') findChildTag:'head'.
This would only be useful for large pages. Dealing with the content of
resources (like parsing HTML) is outside the scope of Zinc. However, Ican
help you get started.
What you want to do is use streaming. That gives you access to thecontent
of a resource using a direct stream, so you could decide to stop reading
(but then you have to close the connection, else you need to readeverything
anyway).
Start by having a look at ZnClient>>#downloadTo: and ZnStreamingEntity.What
you want to do is more or less the following.

ZnClient new
 url: 'http://pharo.org';
 streaming: true;
 get.
At this point, the request is done, the response is in, but the entityof
the response is not yet read. When you ask for the entity, you get a
ZnStreamingEntity which holds the stream that you then have to readfrom.
You can check the response (and its header) for meta info.
Your next challenge then is to process this stream so that you canparse it
in a real streaming fashion. I don't know if Soup can do this.

Sven




--
Using Opera's mail client: http://www.opera.com/mail/

Re: [Pharo-users] ZnClient GET, but just the content of the tag?

Reply via email to