You could use XMLHTMLParser from STHub PharoExtras/XMLParserHTML (supported on Pharo, Squak, and GS):
descriptions := OrderedCollection new. (XMLHTMLParser parseURL: aURL) allElementsNamed: 'meta' do: [:each | ((each attributeAt: 'name') asLowercase = 'description' or: [(each attributeAt: 'http-equiv') asLowercase = 'description']) ifTrue: [descriptions addLast: (each attributeAt: 'content')]]. it accepts messy HTML and produces an XML DOM tree from it. > Sent: Thursday, March 30, 2017 at 1:58 PM > From: "PAUL DEBRUICKER" <pdebr...@gmail.com> > To: "Any question about pharo is welcome" <pharo-users@lists.pharo.org> > Subject: [Pharo-users] PetitParser question parsing HTML meta tags > > This is kind of a "I'm tired of thinking about this and not making much > progress for the amount of time I'm putting in question" but here it is: > > > > I'm trying to parse descriptions from HTML meta elements. I can't use Soup > because there isn't a working GemStone port. > > I've got it to work with the structure: > > <meta name="description" content="my description"> > > and > > <meta name="Description" content="my description"> > > > but I'm running into instances of: > > <meta http-equiv="description" content="my description"> > > and > > <meta http-equiv="Description" content="my description"> > > > and am having trouble adapting my parsing code (such as it is). > > > The parsing code that addresses the first two cases is: > > > > parseHtmlPageForDescription: htmlString > | startParser endParser ppStream descParser result text lower str > doubleQuoteIndex | > lower := 'escription' asParser. > startParser := '<meta name=' asParser , #'any' asParser , #'any' asParser. > endParser := '>' asParser. > ppStream := htmlString readStream asPetitStream. > descParser := ((#'any' asParser starLazy: startParser , lower) > , (#'any' asParser starLazy: endParser)) ==> #'second'. > result := descParser parse: ppStream. > text := (result > inject: (WriteStream on: String new) > into: [ :stream :char | > stream nextPut: char. > stream ]) > contents trimBoth. > str := text copyFrom: (text findString: 'content=') + 9 to: text size. > doubleQuoteIndex := 8 - ((str last: 7) indexOf: $"). > ^ str copyFrom: 1 to: str size - doubleQuoteIndex > > > I can't figure out how to change the startParser parser to accept the second > idiom. And maybe there's a better approach altogether. Anyway. If anyone > has any ideas on different approaches I'd appreciate learning them. > > > Thanks for giving it some thought > > Paul >