[Pharo-users] PetitParser question parsing HTML meta tags

Hartmut Krasemann Fri, 31 Mar 2017 04:07:39 -0700

its easier this way (still in form of a script):

parseHtmlPageForDescription: htmlString

| parser endParser metaParser descriptionParser contentParserres1Parser res2Parser quoteParser nonQuoteParser |

  metaParser := '<meta name="' asParser.
  "the next line extends the parser to understand http-equiv"
  metaParser := metaParser | '<meta http-equiv="' asParser.
  quoteParser := $" asParser.
  nonQuoteParser := PPPredicateObjectParser anyExceptAnyOf: '"'.
  descriptionParser := nonQuoteParser star token.
  res1Parser := descriptionParser .
  res2Parser := descriptionParser .
  contentParser := '" content="' asParser trim.
  endParser := '">' asParser.

parser := (metaParser, res1Parser, contentParser, res2Parser,endParser) end==> [:nodes| Array with: (nodes at: 2) inputValue with:(nodes at: 4) inputValue ].

  ^parser parse: htmlString.



"self parseHtmlPageForDescription:  self htmlString1
self parseHtmlPageForDescription:  self htmlString2
self parseHtmlPageForDescription:  self htmlString3   "

with
htmlString1
  ^'<meta name="  Description" content="my description">'
etc..

you may want to read http://www.lukas-renggli.ch/blog/petitparser-1

good luck
Hartmut

This is kind of a "I'm tired of thinking about this and not making much progress for 
the amount of time I'm putting in question" but here it is:



I'm trying to parse descriptions from HTML meta elements.  I can't use Soup 
because there isn't a working GemStone port.

I've got it to work with the structure:

<meta name="description" content="my description">

and

<meta name="Description" content="my description">


but I'm running into instances of:

<meta http-equiv="description" content="my description">

and

<meta http-equiv="Description" content="my description">


and am having trouble adapting my parsing code (such as it is).


The parsing code that addresses the first two cases is:



parseHtmlPageForDescription: htmlString
   | startParser endParser ppStream descParser result text lower str 
doubleQuoteIndex |
   lower := 'escription' asParser.
   startParser := '<meta name=' asParser , #'any' asParser , #'any' asParser.
   endParser := '>' asParser.
   ppStream := htmlString readStream asPetitStream.
   descParser := ((#'any' asParser starLazy: startParser , lower)
     , (#'any' asParser starLazy: endParser)) ==> #'second'.
   result := descParser parse: ppStream.
   text := (result
     inject: (WriteStream on: String new)
     into: [ :stream :char |
       stream nextPut: char.
       stream ])
     contents trimBoth.
   str := text copyFrom: (text findString: 'content=') + 9 to: text size.
   doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
   ^ str copyFrom: 1 to: str size - doubleQuoteIndex


I can't figure out how to change the startParser parser to accept the second 
idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone has 
any ideas on different approaches I'd appreciate learning them.


Thanks for giving it some thought

Paul



--
signatur

Hartmut Krasemann
Königsberger Str. 41 c
D 22869 Schenefeld
Tel. 040.8307097
Mobil 0171.6451283
krasem...@acm.org

[Pharo-users] PetitParser question parsing HTML meta tags

Reply via email to