Date: 1 Aug 2022
Module: scrape Installation: pip install scrape About: Scrape is a rule-based web crawler and information extraction tool capable of manipulating and merging new and existing documents. XML Path Language (XPath) and regular expressions are used to define rules for filtering content and web traversal. Output may be converted into text, csv, pdf, and/or HTML formats. Sample Source Code: from scrape import scrape, utils def call_scrape(cmd, filetype, num_files=None): if not isinstance(cmd, list): cmd = [cmd] parser = scrape.get_parser() args = vars(parser.parse_args(cmd)) args["overwrite"] = True # Avoid overwrite prompt if args["crawl"] or args["crawl_all"]: args["no_images"] = True # Avoid save image prompt when crawling args[filetype] = True if num_files is not None: args[num_files] = True return scrape.scrape(args) call_scrape(["demo.html"], "text") Input: demo.html <html><body> ADMISSION TO ONLINE COLLEGE <P> Aplicants are considered for admission to Online College on the basis of their ISP, quality of their home pages and quantity of emails exchanged per day. <P> It is recommended that students prepare for enrollment in Online College by signing up for DSL service and buying a new computer. <P> <A HREF="home.html">Back to Online College home page</A> </body> </HTML> Execution: $ python scrape_sample.py Output: demo.txt ADMISSION TO ONLINE COLLEGEAplicants are considered for admission to Online College on the basis of their ISP, quality of their home pages and quantity of emails exchanged per day.It is recommended that students prepare for enrollment in Online College by signing up for DSL service and buying a new computer. Back to Online College home page Reference: https://pypi.org/project/scrape/
_______________________________________________ Chennaipy mailing list Chennaipy@python.org https://mail.python.org/mailman/listinfo/chennaipy