its easier this way (still in form of a script):
parseHtmlPageForDescription: htmlString
| parser endParser metaParser descriptionParser contentParser
res1Parser res2Parser quoteParser nonQuoteParser |
metaParser := '<meta name="' asParser.
"the next line extends the parser to understand http-equiv"
metaParser := metaParser | '<meta http-equiv="' asParser.
quoteParser := $" asParser.
nonQuoteParser := PPPredicateObjectParser anyExceptAnyOf: '"'.
descriptionParser := nonQuoteParser star token.
res1Parser := descriptionParser .
res2Parser := descriptionParser .
contentParser := '" content="' asParser trim.
endParser := '">' asParser.
parser := (metaParser, res1Parser, contentParser, res2Parser,
endParser) end
==> [:nodes| Array with: (nodes at: 2) inputValue with:
(nodes at: 4) inputValue ].
^parser parse: htmlString.
"self parseHtmlPageForDescription: self htmlString1
self parseHtmlPageForDescription: self htmlString2
self parseHtmlPageForDescription: self htmlString3 "
with
htmlString1
^'<meta name=" Description" content="my description">'
etc..
you may want to read http://www.lukas-renggli.ch/blog/petitparser-1
good luck
Hartmut
This is kind of a "I'm tired of thinking about this and not making much progress for
the amount of time I'm putting in question" but here it is:
I'm trying to parse descriptions from HTML meta elements. I can't use Soup
because there isn't a working GemStone port.
I've got it to work with the structure:
<meta name="description" content="my description">
and
<meta name="Description" content="my description">
but I'm running into instances of:
<meta http-equiv="description" content="my description">
and
<meta http-equiv="Description" content="my description">
and am having trouble adapting my parsing code (such as it is).
The parsing code that addresses the first two cases is:
parseHtmlPageForDescription: htmlString
| startParser endParser ppStream descParser result text lower str
doubleQuoteIndex |
lower := 'escription' asParser.
startParser := '<meta name=' asParser , #'any' asParser , #'any' asParser.
endParser := '>' asParser.
ppStream := htmlString readStream asPetitStream.
descParser := ((#'any' asParser starLazy: startParser , lower)
, (#'any' asParser starLazy: endParser)) ==> #'second'.
result := descParser parse: ppStream.
text := (result
inject: (WriteStream on: String new)
into: [ :stream :char |
stream nextPut: char.
stream ])
contents trimBoth.
str := text copyFrom: (text findString: 'content=') + 9 to: text size.
doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
^ str copyFrom: 1 to: str size - doubleQuoteIndex
I can't figure out how to change the startParser parser to accept the second
idiom. And maybe there's a better approach altogether. Anyway. If anyone has
any ideas on different approaches I'd appreciate learning them.
Thanks for giving it some thought
Paul
--
signatur
Hartmut Krasemann
Königsberger Str. 41 c
D 22869 Schenefeld
Tel. 040.8307097
Mobil 0171.6451283
krasem...@acm.org