I am alsmost a complete novice with respect to HTML and web stuff, but suddenly find myself needing to change that. I have read around web scraping but most stuff is much more ambitious than what I want to do, focusses on python and assumes slightly (?)shady business goals - like scraping data from your competitors' websites. My goal is pure research...and of course I would love to do it in Livecode
I want to work through a list of (pseudo) random URLs, harvest the page title and any keywords, and aggregate the results in a field/table. So I don't want the URLs to be found on the basis of content or top level domain at all. Metaphorically, dipping into the WWW bran tub allowing all domain suffixes, pulling out a random page, checking it has english content, check it has keywords, extract the page title, extract any keywords, save them to a table indexed by URL and then move on to the next lucky dip. The specifics I would appreciate advice on are 1/ how to sample as close to a random sample of URLs as possible. There are websites that purport to take you to a random www page, but I couldn't work out how they pull that off - or indeed how random the destination really is. They also want to do it only one by one, whereas I want to do it a few thousand times on the bounce, ideally without visiting any page in the browsing sense. 2/ how might I check the URL a) is in english and b) contains keywords 3/ is it possible to extract the title and keywords from a URL using Livecode 'remotely' or do I need to use a browser to visit? Thanks in advance for any advice or thoughts Cheers David G -- David Glasgow Consultant Forensic & Clinical Psychologist Honorary Professor, Nottingham Trent University Sexual Offences, Crime and Misconduct Research Unit Carlton Glasgow Partnership Director, Child & Family Training, York _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode