Re: [Pharo-users] PetitParser question parsing HTML meta tags

Paul DeBruicker Wed, 05 Apr 2017 08:40:29 -0700

Thanks.  I really appreciate everyone's help on this.  Was at a high level of
frustration the other day.









monty-3 wrote
> You could use XMLHTMLParser from STHub PharoExtras/XMLParserHTML
> (supported on Pharo, Squak, and GS):
> 
> descriptions := OrderedCollection new.
> (XMLHTMLParser parseURL: aURL)
>       allElementsNamed: 'meta'
>       do: [:each |
>               ((each attributeAt: 'name') asLowercase = 'description'
>                       or: [(each attributeAt: 'http-equiv') asLowercase = 
> 'description'])
>                       ifTrue: [descriptions addLast: (each attributeAt: 
> 'content')]].
> 
> it accepts messy HTML and produces an XML DOM tree from it.
> 
>> Sent: Thursday, March 30, 2017 at 1:58 PM
>> From: "PAUL DEBRUICKER" &lt;

> pdebruic@

> &gt;
>> To: "Any question about pharo is welcome" &lt;

> pharo-users@.pharo

> &gt;
>> Subject: [Pharo-users] PetitParser question parsing HTML meta tags
>>
>> This is kind of a "I'm tired of thinking about this and not making much
>> progress for the amount of time I'm putting in question" but here it is: 
>> 
>> 
>> 
>> I'm trying to parse descriptions from HTML meta elements.  I can't use
>> Soup because there isn't a working GemStone port.  
>> 
>> I've got it to work with the structure:
>> 
>> 
> <meta name="description" content="my description">
>> 
>> and 
>> 
>> 
> <meta name="Description" content="my description">
>> 
>> 
>> but I'm running into instances of: 
>> 
>> 
> <meta http-equiv="description" content="my description">
>> 
>> and
>> 
>> 
> <meta http-equiv="Description" content="my description">
>> 
>> 
>> and am having trouble adapting my parsing code (such as it is). 
>> 
>> 
>> The parsing code that addresses the first two cases is:
>> 
>> 
>> 
>> parseHtmlPageForDescription: htmlString
>>   | startParser endParser ppStream descParser result text lower str
>> doubleQuoteIndex |
>>   lower := 'escription' asParser.
>>   startParser := '
> <meta name=' asParser , #'any' asParser , #'any' asParser.
>>
>    endParser := '>' asParser.
>>   ppStream := htmlString readStream asPetitStream.
>>   descParser := ((#'any' asParser starLazy: startParser , lower)
>>     , (#'any' asParser starLazy: endParser)) ==> #'second'.
>>   result := descParser parse: ppStream.
>>   text := (result
>>     inject: (WriteStream on: String new)
>>     into: [ :stream :char | 
>>       stream nextPut: char.
>>       stream ])
>>     contents trimBoth.
>>   str := text copyFrom: (text findString: 'content=') + 9 to: text size.
>>   doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
>>   ^ str copyFrom: 1 to: str size - doubleQuoteIndex
>> 
>> 
>> I can't figure out how to change the startParser parser to accept the
>> second idiom.  And maybe there's a better approach altogether.  Anyway. 
>> If anyone has any ideas on different approaches I'd appreciate learning
>> them.  
>> 
>> 
>> Thanks for giving it some thought
>> 
>> Paul
>>





--
View this message in context: 
http://forum.world.st/PetitParser-question-parsing-HTML-meta-tags-tp4940587p4941367.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.

Re: [Pharo-users] PetitParser question parsing HTML meta tags

Reply via email to