>> Well.....when using the file linux.words as a useful master list of >> "words".....linux.words is strict ASCII........
The meaning of "words" depends on the context. The contents of the file mentioned are a minor attempt to capture a common subset of words in English but probably are not what you mean by words in other contexts including words also in ASCII format like names and especially uncommon names or words like UNESCO. There are other selected lists of words such as valid Scrabble words or WORLDLE words for specialized purposes that exclude words of lengths that can not be used. The person looking to count words in a work must determine what words make sense for their purpose. ASCII is a small subset of UNICODE. So when using a concept of word that includes many characters from many character sets, and in many languages, things may not be easy to parse uniquely such as words containing something like an apostrophe earlier on as in d'eau. Words can flow in different directions. There can be fairly complex rules and sometimes things like compound words may need to be considered to either be one or multiple words and may even occur both ways in the same work so is every body the same as everybody? So what is being discussed here may have several components. One is to tokenize all the text to make a set of categories. Another is to count them. Perhaps another might even analyze and combine multiple categories or even look at words in context to determine if two uses of the same word are different enough to try to keep both apart in two categories Is polish the same as Polish? Once that is decided, you have a fairly simple exercise in storing the data in a searchable data structure and doing your searches to get subsets and counts and so on. As mentioned, the default native format in Python is UNICODE and ASCII files being read in may well be UNICODE internally unless you carefully ask otherwise. The conversion from ASCII to UNICODE is trivial. As for how well the regular expressions like \w work in general, I have no idea. I can be very sure they are way more costly than the simpler ones you can write that just know enough about what English words in ASCII look like and perhaps get it wrong on some edge cases. -----Original Message----- From: Python-list <python-list-bounces+avi.e.gross=gmail....@python.org> On Behalf Of Edward Teach via Python-list Sent: Tuesday, June 4, 2024 7:22 AM To: python-list@python.org Subject: Re: From JoyceUlysses.txt -- words occurring exactly once On Mon, 03 Jun 2024 14:58:26 -0400 (EDT) Grant Edwards <grant.b.edwa...@gmail.com> wrote: > On 2024-06-03, Edward Teach via Python-list <python-list@python.org> > wrote: > > > The Gutenburg Project publishes "plain text". That's another > > problem, because "plain text" means UTF-8....and that means > > unicode...and that means running some sort of unicode-to-ascii > > conversion in order to get something like "words". A couple of > > hours....a couple of hundred lines of C....problem solved! > > I'm curious. Why does it need to be converted frum Unicode to ASCII? > > When you read it into Python, it gets converted right back to > Unicode... > > > Well.....when using the file linux.words as a useful master list of "words".....linux.words is strict ASCII........ -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list