On Sat, 1 Jun 2024 13:34:11 -0600 Mats Wichmann <m...@wichmann.us> wrote:
> On 5/31/24 11:59, Dieter Maurer via Python-list wrote: > > hmmm, I "sent" this but there was some problem and it remained > unsent. Just in case it hasn't All Been Said Already, here's the > retry: > > > HenHanna wrote at 2024-5-30 13:03 -0700: > >> > >> Given a text file of a novel (JoyceUlysses.txt) ... > >> > >> could someone give me a pretty fast (and simple) Python program > >> that'd give me a list of all words occurring exactly once? > > > > Your task can be split into several subtasks: > > * parse the text into words > > > > This depends on your notion of "word". > > In the simplest case, a word is any maximal sequence of > > non-whitespace characters. In this case, you can use `split` for > > this task > > This piece is by far "the hard part", because of the ambiguity. For > example, if I just say non-whitespace, then I get as distinct words > followed by punctuation. What about hyphenation - of which there's > both the compound word forms and the ones at the end of lines if the > source text has been formatted that way. Are all-lowercase words > different than the same word starting with a capital? What about > non-initial capitals, as happens a fair bit in modern usage with > acronyms, trademarks (perhaps not in Ulysses? :-) ), etc. What about > accented letters? > > If you want what's at least a quick starting point to play with, you > could use a very simple regex - a fair amount of thought has gone > into what a "word character" is (\w), so it deals with excluding both > punctuation and whitespace. > > import re > from collections import Counter > > with open("JoyceUlysses/txt", "r") as f: > wordcount = Counter(re.findall(r'\w+', f.read().lower())) > > Now you have a Counter object counting all the "words" with their > occurrence counts (by this definition) in the document. You can fish > through that to answer the questions asked (find entries with a count > of 1, 2, 3, etc.) > > Some people Go Big and use something that actually tries to recognize > the language, and opposed to making assumptions from ranges of > characters. nltk is a choice there. But at this point it's not > really "simple" any longer (though nltk experts might end up > disagreeing with that). > > The Gutenburg Project publishes "plain text". That's another problem, because "plain text" means UTF-8....and that means unicode...and that means running some sort of unicode-to-ascii conversion in order to get something like "words". A couple of hours....a couple of hundred lines of C....problem solved! -- https://mail.python.org/mailman/listinfo/python-list