Hey, This is something I have been working on for a very long time. It’s one of the reasons I got into programming at all. I’d really appreciate if people could input some advice on this.
This is a really simple program which extracts the text from webpages and displays them one sentence at a time. It’s meant to help you study dense material, especially documentation, with much more focus and comprehension. I actually hope it can be of help to people who have difficulty reading. I know it’s been of use to me at least. This is a minimally acceptable way to pull it off currently: deepreader.py: import sys import requests import html2text import nltk url = sys.argv[1] # Get the html, pull out the text, and sentence-segment it in one line of code sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text)) # Activate an elementary reader interface for the text for index, sentence in enumerate(sentences): # Print the sentence print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence + “\n”) # Wait for user key-press x = input(“\n> “) EOF That’s it. A lot of refining is possible, and I’d really like to see how some more experienced people might handle it. 1. The HTML extraction is not perfect. It doesn’t produce as clean text as I would like. Sometimes random links or tags get left in there. And the sentences are sometimes randomly broken by newlines. 2. Neither is the segmentation perfect. I am currently researching developing an optimal segmenter with tools from Spacy. Brevity is greatly valued. I mean, anyone who can make the program more perfect, that’s hugely appreciated. But if someone can do it in very few lines of code, that’s also appreciated. Thanks very much, Julius -- https://mail.python.org/mailman/listinfo/python-list