Re: From JoyceUlysses.txt -- words occurring exactly once
Edward Teach wrote at 2024-6-3 10:47 +0100: > ... >The Gutenburg Project publishes "plain text". That's another problem, >because "plain text" means UTF-8and that means unicode...and that >means running some sort of unicode-to-ascii conversion in order to get >something like "words". A couple of hoursa couple of hundred lines >of Cproblem solved! Unicode supports the notion "owrd" even better "ASCII". For example, the `\w` (word charavter) regular expression wild card, works for Unicode like for ASCII (of course with enhanced letter, digits, punctuation, etc.) -- https://mail.python.org/mailman/listinfo/python-list
Re: From JoyceUlysses.txt -- words occurring exactly once
On Mon, 03 Jun 2024 14:58:26 -0400 (EDT) Grant Edwards wrote: > On 2024-06-03, Edward Teach via Python-list > wrote: > > > The Gutenburg Project publishes "plain text". That's another > > problem, because "plain text" means UTF-8and that means > > unicode...and that means running some sort of unicode-to-ascii > > conversion in order to get something like "words". A couple of > > hoursa couple of hundred lines of Cproblem solved! > > I'm curious. Why does it need to be converted frum Unicode to ASCII? > > When you read it into Python, it gets converted right back to > Unicode... > > > Well.when using the file linux.words as a useful master list of "words".linux.words is strict ASCII -- https://mail.python.org/mailman/listinfo/python-list
IDLE: clearing the screen
Hello everyone, I am new to Python, and I have been using IDLE (v3.10.11) to run small Python code. However, I have seen that the output scrolls to the bottom in the output window. Is there a way to clear the output window (something like cls in command prompt or clear in terminal), so that output stays at the top? Thanks in anticipation! -- https://mail.python.org/mailman/listinfo/python-list
Re: From JoyceUlysses.txt -- words occurring exactly once
On 2024-06-04, Edward Teach via Python-list wrote: > On Mon, 03 Jun 2024 14:58:26 -0400 (EDT) > Grant Edwards wrote: > >> On 2024-06-03, Edward Teach via Python-list >> wrote: >> >> > The Gutenburg Project publishes "plain text". That's another >> > problem, because "plain text" means UTF-8and that means >> > unicode...and that means running some sort of unicode-to-ascii >> > conversion in order to get something like "words". A couple of >> > hoursa couple of hundred lines of Cproblem solved! >> >> I'm curious. Why does it need to be converted frum Unicode to ASCII? >> >> When you read it into Python, it gets converted right back to >> Unicode... > Well.when using the file linux.words as a useful master list of > "words".linux.words is strict ASCII I guess I missed the part of the problem description where it said to use linux.words to decide what a word is. :) -- Grant -- https://mail.python.org/mailman/listinfo/python-list
RE: From JoyceUlysses.txt -- words occurring exactly once
>> Well.when using the file linux.words as a useful master list of >> "words".linux.words is strict ASCII The meaning of "words" depends on the context. The contents of the file mentioned are a minor attempt to capture a common subset of words in English but probably are not what you mean by words in other contexts including words also in ASCII format like names and especially uncommon names or words like UNESCO. There are other selected lists of words such as valid Scrabble words or WORLDLE words for specialized purposes that exclude words of lengths that can not be used. The person looking to count words in a work must determine what words make sense for their purpose. ASCII is a small subset of UNICODE. So when using a concept of word that includes many characters from many character sets, and in many languages, things may not be easy to parse uniquely such as words containing something like an apostrophe earlier on as in d'eau. Words can flow in different directions. There can be fairly complex rules and sometimes things like compound words may need to be considered to either be one or multiple words and may even occur both ways in the same work so is every body the same as everybody? So what is being discussed here may have several components. One is to tokenize all the text to make a set of categories. Another is to count them. Perhaps another might even analyze and combine multiple categories or even look at words in context to determine if two uses of the same word are different enough to try to keep both apart in two categories Is polish the same as Polish? Once that is decided, you have a fairly simple exercise in storing the data in a searchable data structure and doing your searches to get subsets and counts and so on. As mentioned, the default native format in Python is UNICODE and ASCII files being read in may well be UNICODE internally unless you carefully ask otherwise. The conversion from ASCII to UNICODE is trivial. As for how well the regular expressions like \w work in general, I have no idea. I can be very sure they are way more costly than the simpler ones you can write that just know enough about what English words in ASCII look like and perhaps get it wrong on some edge cases. -Original Message- From: Python-list On Behalf Of Edward Teach via Python-list Sent: Tuesday, June 4, 2024 7:22 AM To: python-list@python.org Subject: Re: From JoyceUlysses.txt -- words occurring exactly once On Mon, 03 Jun 2024 14:58:26 -0400 (EDT) Grant Edwards wrote: > On 2024-06-03, Edward Teach via Python-list > wrote: > > > The Gutenburg Project publishes "plain text". That's another > > problem, because "plain text" means UTF-8and that means > > unicode...and that means running some sort of unicode-to-ascii > > conversion in order to get something like "words". A couple of > > hoursa couple of hundred lines of Cproblem solved! > > I'm curious. Why does it need to be converted frum Unicode to ASCII? > > When you read it into Python, it gets converted right back to > Unicode... > > > Well.when using the file linux.words as a useful master list of "words".linux.words is strict ASCII -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Fwd: IDLE: clearing the screen
Welcome to Python! A great language for program development. Answers might be platform-dependent (are you using WIndows, Linux, etc.). However, the following works for me on WIndows. You can put it in the startup.py file so you don't have to type it every time you start up the IDLE. import os def cls(): x=os.system("cls") Now whenever you type cls() it will clear the screen and show the prompt at the top of the screen. (The reason for the "x=" is: os.system returns a result, in this case 0. When you evaluate an expression in the IDE, the IDE prints the result. So without the "x=" you get an extra line at the top of the screen containing "0".) I am sure that some jiggery-pokery could be used so you don't have to type the "()". But that's more advanced ... Best wishes Rob Cliffe On 04/06/2024 14:34, Cave Man via Python-list wrote: Hello everyone, I am new to Python, and I have been using IDLE (v3.10.11) to run small Python code. However, I have seen that the output scrolls to the bottom in the output window. Is there a way to clear the output window (something like cls in command prompt or clear in terminal), so that output stays at the top? Thanks in anticipation! -- https://mail.python.org/mailman/listinfo/python-list
Re: From JoyceUlysses.txt -- words occurring exactly once
On Wed, 5 Jun 2024 at 02:49, Edward Teach via Python-list wrote: > > On Mon, 03 Jun 2024 14:58:26 -0400 (EDT) > Grant Edwards wrote: > > > On 2024-06-03, Edward Teach via Python-list > > wrote: > > > > > The Gutenburg Project publishes "plain text". That's another > > > problem, because "plain text" means UTF-8and that means > > > unicode...and that means running some sort of unicode-to-ascii > > > conversion in order to get something like "words". A couple of > > > hoursa couple of hundred lines of Cproblem solved! > > > > I'm curious. Why does it need to be converted frum Unicode to ASCII? > > > > When you read it into Python, it gets converted right back to > > Unicode... > > > > Well.when using the file linux.words as a useful master list of > "words".linux.words is strict ASCII > Whatever gave you that idea? I have a large number of dictionaries in /usr/share/dict, all of them encoded UTF-8 except one (and I don't know why that is). Even the English ones aren't entirely ASCII. There is no need to "convert from Unicode to ASCII", which makes no sense. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: IDLE: clearing the screen
On 04Jun2024 22:43, Rob Cliffe wrote: import os def cls(): x=os.system("cls") Now whenever you type cls() it will clear the screen and show the prompt at the top of the screen. (The reason for the "x=" is: os.system returns a result, in this case 0. When you evaluate an expression in the IDE, the IDE prints the result. So without the "x=" you get an extra line at the top of the screen containing "0".) Not if it's in a function, because the IDLE prints the result if it isn't None, and your function returns None. So: def cls(): os.system("cls") should be just fine. -- https://mail.python.org/mailman/listinfo/python-list
Re: From JoyceUlysses.txt -- words occurring exactly once
On 31/05/24 14:26, HenHanna via Python-list wrote: On 5/30/2024 2:18 PM, dn wrote: On 31/05/24 08:03, HenHanna via Python-list wrote: Given a text file of a novel (JoyceUlysses.txt) ... could someone give me a pretty fast (and simple) Python program that'd give me a list of all words occurring exactly once? -- Also, a list of words occurring once, twice or 3 times re: hyphenated words (you can treat it anyway you like) but ideally, i'd treat [editor-in-chief] [go-ahead] [pen-knife] [know-how] [far-fetched] ... as one unit. Split into words - defined as you will. Use Counter. Show some (of your) code and we'll be happy to critique... hard to decide what to do with hyphens and apostrophes (I'd, he's, can't, haven't, A's and B's) 2-step-Process 1. make a file listing all words (one word per line) 2. then, doing the counting. using from collections import Counter Apologies for lateness - only just able to come back to this. This issue is not Python, and is not solved by code! If you/your teacher can't define a "word", the code, any code, will almost-certainly be wrong! One of the interesting aspects of our work is that we can write all manner of tests to try to ensure that the code is correct: unit tests, integration tests, system tests, acceptance tests, eye-tests, ... However, there is no such thing as a test (or proof) that statements of requirements are complete or correct! (nor for any other previous stages of the full project life-cycle) As coders we need to learn to require clear specifications and not attempt to read-between-the-lines, use our initiative, or otherwise 'not bother the ...'. When there is ambiguity, we should go back to the user/client/boss and seek clarification. They are the domain/subject-matter experts... I'm reminded of a cartoon, possibly from some IBM source, first seen in black-and-white but here in living-color: https://www.monolithic.org/blogs/presidents-sphere/what-the-customer-really-wants That has been the sad history of programming and dev.projects - wherein we are blamed for every short-coming, because no-one else understands the nuances of development projects. If we don't insist on clarity, are we our own worst enemy? -- Regards, =dn -- https://mail.python.org/mailman/listinfo/python-list