Re: [Pharo-users] PetitParser question parsing HTML meta tags

monty Sun, 02 Apr 2017 16:36:29 -0700

XMLParserHTML is the fastest HTML parser on Pharo, Squeak, and GS. It has DOM 
and SAX parsers and works with other libs such as PharoExtras/XPath and 
PharoExtras/XMLParserStAX.


Element and attribute names are normalized to lowercase, and printing XML DOM 
trees back as HTML is complicated by browsers not recognizing XML-style 
self-closing tags ending with "/>" for some elements (like "script"), so use 
#printedWithoutSelfClosingTags/#printWithoutSelfClosingTagsOn:/#printWithoutSelfClosingTagsToFileNamed:
 instead.

> Sent: Thursday, March 30, 2017 at 1:58 PM
> From: "PAUL DEBRUICKER" <pdebr...@gmail.com>
> To: "Any question about pharo is welcome" <pharo-users@lists.pharo.org>
> Subject: [Pharo-users] PetitParser question parsing HTML meta tags
>
> This is kind of a "I'm tired of thinking about this and not making much 
> progress for the amount of time I'm putting in question" but here it is: 
> 
> 
> 
> I'm trying to parse descriptions from HTML meta elements.  I can't use Soup 
> because there isn't a working GemStone port.  
> 
> I've got it to work with the structure:
> 
> <meta name="description" content="my description">
> 
> and 
> 
> <meta name="Description" content="my description">
> 
> 
> but I'm running into instances of: 
> 
> <meta http-equiv="description" content="my description">
> 
> and
> 
> <meta http-equiv="Description" content="my description">
> 
> 
> and am having trouble adapting my parsing code (such as it is). 
> 
> 
> The parsing code that addresses the first two cases is:
> 
> 
> 
> parseHtmlPageForDescription: htmlString
>   | startParser endParser ppStream descParser result text lower str 
> doubleQuoteIndex |
>   lower := 'escription' asParser.
>   startParser := '<meta name=' asParser , #'any' asParser , #'any' asParser.
>   endParser := '>' asParser.
>   ppStream := htmlString readStream asPetitStream.
>   descParser := ((#'any' asParser starLazy: startParser , lower)
>     , (#'any' asParser starLazy: endParser)) ==> #'second'.
>   result := descParser parse: ppStream.
>   text := (result
>     inject: (WriteStream on: String new)
>     into: [ :stream :char | 
>       stream nextPut: char.
>       stream ])
>     contents trimBoth.
>   str := text copyFrom: (text findString: 'content=') + 9 to: text size.
>   doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
>   ^ str copyFrom: 1 to: str size - doubleQuoteIndex
> 
> 
> I can't figure out how to change the startParser parser to accept the second 
> idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone 
> has any ideas on different approaches I'd appreciate learning them.  
> 
> 
> Thanks for giving it some thought
> 
> Paul
>

Re: [Pharo-users] PetitParser question parsing HTML meta tags

Reply via email to