Re: Parsing (scraping) OpenGraph Tags from html HEAD

J. Landman Gay via use-livecode Sat, 29 Jul 2017 20:38:30 -0700

Here's where it's handy that delimiters can now be more than a singlecharacter. This should extract the lines you need regardless of whetherthey contain carriage returns or not:


on parseHeader pData
  set the lineDel to "<meta property="
  repeat for each line l in pData

if l contains "og:" then put char 1 to offset(">",l)-1 of l & crafter tList

  end repeat
  -- do something with tList
end parseHeader


On 7/29/17 3:16 PM, Sannyasin Brahmanathaswami via use-livecode wrote:

given that

a) trying to instantiate an XML tree from any given web page is likely to fail 
85% of the time because they simply are never built to that strict a standard


and


b) you want to extract from the <head> of the document  the openGraph  tags

<meta property="og:site_name" content="YouTube">
<meta property="og:url" content="https://www.youtube.com/user/kauaiaadheenam";>
<meta property="og:title" content="Kauai's Hindu Monastery">
<meta property="og:image" 
content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg";>
<meta property="og:description" content="{where hinduism meets the future}">

c) you also cannot depend on the output being line delimited, because some CMS's delivery 
"agents" will minimize this to

<meta property="og:site_name" content="YouTube"><meta property="og:url" content="https://www.youtube.com/user/kauaiaadheenam";><meta 
property="og:title" content="Kauai's Hindu Monastery"><meta property="og:image" 
content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg";><meta property="og:description" content="{where hinduism 
meets the future}">

Has anyone rolled up a parser/scraper for this?   Looks like "idiot simple text extraction"  but I'm 
trying to wrap my head around how to extract the name=value pairs, and not getting anything easy…  these are space 
delimited, but then we also have spaces inside quoted strings.  Maybe easier target "<meta (.*?)>" 
using regEx with matchText, get ALL the meta tags in the HEAD, push to array then just check for if key contains 
"og:"  then we have an openGraph value.

I'll sleep on this, but but before I wake up and write 50 lines to get this 
done…  I see the other thread on scraping pages generated by JS and suspect 
perhaps some wizard among us already has this done…would save a bit of time 
here.

BR




_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode



--
Jacqueline Landman Gay         |     jac...@hyperactivesw.com
HyperActive Software           |     http://www.hyperactivesw.com


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Parsing (scraping) OpenGraph Tags from html HEAD

Reply via email to