Re: From JoyceUlysses.txt -- words occurring exactly once

2024-06-04 Thread Dieter Maurer via Python-list
Edward Teach wrote at 2024-6-3 10:47 +0100:
> ...
>The Gutenburg Project publishes "plain text".  That's another problem,
>because "plain text" means UTF-8and that means unicode...and that
>means running some sort of unicode-to-ascii conversion in order to get
>something like "words".  A couple of hoursa couple of hundred lines
>of Cproblem solved!

Unicode supports the notion "owrd" even better "ASCII".
For example, the `\w` (word charavter) regular expression wild card,
works for Unicode like for ASCII (of course with enhanced letter,
digits, punctuation, etc.)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: From JoyceUlysses.txt -- words occurring exactly once

2024-06-04 Thread Edward Teach via Python-list
On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
Grant Edwards  wrote:

> On 2024-06-03, Edward Teach via Python-list 
> wrote:
> 
> > The Gutenburg Project publishes "plain text".  That's another
> > problem, because "plain text" means UTF-8and that means
> > unicode...and that means running some sort of unicode-to-ascii
> > conversion in order to get something like "words".  A couple of
> > hoursa couple of hundred lines of Cproblem solved!  
> 
> I'm curious.  Why does it need to be converted frum Unicode to ASCII?
> 
> When you read it into Python, it gets converted right back to
> Unicode...
> 
> 
> 

Well.when using the file linux.words as a useful master list of
"words".linux.words is strict ASCII

-- 
https://mail.python.org/mailman/listinfo/python-list


IDLE: clearing the screen

2024-06-04 Thread Cave Man via Python-list

Hello everyone,

I am  new to Python, and I have been using IDLE (v3.10.11) to run small 
Python code. However, I have seen that the output scrolls to the bottom 
in the output window.


Is there a way to clear the output window (something like cls in command 
prompt or clear in terminal), so that output stays at the top?



Thanks in anticipation!
--
https://mail.python.org/mailman/listinfo/python-list


Re: From JoyceUlysses.txt -- words occurring exactly once

2024-06-04 Thread Grant Edwards via Python-list
On 2024-06-04, Edward Teach via Python-list  wrote:
> On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
> Grant Edwards  wrote:
>
>> On 2024-06-03, Edward Teach via Python-list 
>> wrote:
>> 
>> > The Gutenburg Project publishes "plain text".  That's another
>> > problem, because "plain text" means UTF-8and that means
>> > unicode...and that means running some sort of unicode-to-ascii
>> > conversion in order to get something like "words".  A couple of
>> > hoursa couple of hundred lines of Cproblem solved!  
>> 
>> I'm curious.  Why does it need to be converted frum Unicode to ASCII?
>> 
>> When you read it into Python, it gets converted right back to
>> Unicode...

> Well.when using the file linux.words as a useful master list of
> "words".linux.words is strict ASCII

I guess I missed the part of the problem description where it said to
use linux.words to decide what a word is. :)

--
Grant


-- 
https://mail.python.org/mailman/listinfo/python-list


RE: From JoyceUlysses.txt -- words occurring exactly once

2024-06-04 Thread AVI GROSS via Python-list
>> Well.when using the file linux.words as a useful master list of
>> "words".linux.words is strict ASCII

The meaning of "words" depends on the context. The contents of the file
mentioned are a minor attempt to capture a common subset of words in English
but probably are not what you mean by words in other contexts including
words also in ASCII format  like names and especially uncommon names or
words like UNESCO. There are other selected lists of words such as valid
Scrabble words or WORLDLE words for specialized purposes that exclude words
of lengths that can not be used. The person looking to count words in a work
must determine what words make sense for their purpose.

ASCII is a small subset of UNICODE. So when using a concept of word that
includes many characters from many character sets, and in many languages,
things may not be easy to parse uniquely such as words containing something
like an apostrophe earlier on as in d'eau. Words can flow in different
directions. There can be fairly complex rules and sometimes things like
compound words may need to be considered to either be one or multiple words
and may even occur both ways in the same work so is every body the same as
everybody?

So what is being discussed here may have several components. One is to
tokenize all the text to make a set of categories. Another is to count them.
Perhaps another might even analyze and combine multiple categories or even
look at words in context to determine if two uses of the same word are
different enough to try to keep both apart in two categories Is polish the
same as Polish?

Once that is decided, you have a fairly simple exercise in storing the data
in a searchable data structure and doing your searches to get subsets and
counts and so on.

As mentioned, the default native format in Python is UNICODE and ASCII files
being read in may well be UNICODE internally unless you carefully ask
otherwise. The conversion from ASCII to UNICODE is trivial. 

As for how well the regular expressions like \w work in general, I have no
idea. I can be very sure they are way more costly than the simpler ones you
can write that just know enough about what English words in ASCII look like
and perhaps get it wrong on some edge cases.


-Original Message-
From: Python-list  On
Behalf Of Edward Teach via Python-list
Sent: Tuesday, June 4, 2024 7:22 AM
To: python-list@python.org
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once

On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
Grant Edwards  wrote:

> On 2024-06-03, Edward Teach via Python-list 
> wrote:
> 
> > The Gutenburg Project publishes "plain text".  That's another
> > problem, because "plain text" means UTF-8and that means
> > unicode...and that means running some sort of unicode-to-ascii
> > conversion in order to get something like "words".  A couple of
> > hoursa couple of hundred lines of Cproblem solved!  
> 
> I'm curious.  Why does it need to be converted frum Unicode to ASCII?
> 
> When you read it into Python, it gets converted right back to
> Unicode...
> 
> 
> 

Well.when using the file linux.words as a useful master list of
"words".linux.words is strict ASCII

-- 
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Fwd: IDLE: clearing the screen

2024-06-04 Thread Rob Cliffe via Python-list

Welcome to Python!  A great language for program development.

Answers might be platform-dependent (are you using WIndows, Linux, etc.).
However, the following works for me on WIndows.  You can put it in the 
startup.py file so you don't have to type it every time you start up the 
IDLE.


import os
def cls(): x=os.system("cls")

Now whenever you type
cls()
it will clear the screen and show the prompt at the top of the screen.

(The reason for the "x=" is: os.system returns a result, in this case 
0.  When you evaluate an expression in the IDE, the IDE prints the 
result.  So without the "x=" you get an extra line at the top of the 
screen containing "0".)


I am sure that some jiggery-pokery could be used so you don't have to 
type the "()".  But that's more advanced ...


Best wishes
Rob Cliffe


On 04/06/2024 14:34, Cave Man via Python-list wrote:

Hello everyone,

I am  new to Python, and I have been using IDLE (v3.10.11) to run 
small Python code. However, I have seen that the output scrolls to the 
bottom in the output window.


Is there a way to clear the output window (something like cls in 
command prompt or clear in terminal), so that output stays at the top?



Thanks in anticipation!


--
https://mail.python.org/mailman/listinfo/python-list


Re: From JoyceUlysses.txt -- words occurring exactly once

2024-06-04 Thread Chris Angelico via Python-list
On Wed, 5 Jun 2024 at 02:49, Edward Teach via Python-list
 wrote:
>
> On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
> Grant Edwards  wrote:
>
> > On 2024-06-03, Edward Teach via Python-list 
> > wrote:
> >
> > > The Gutenburg Project publishes "plain text".  That's another
> > > problem, because "plain text" means UTF-8and that means
> > > unicode...and that means running some sort of unicode-to-ascii
> > > conversion in order to get something like "words".  A couple of
> > > hoursa couple of hundred lines of Cproblem solved!
> >
> > I'm curious.  Why does it need to be converted frum Unicode to ASCII?
> >
> > When you read it into Python, it gets converted right back to
> > Unicode...
> >
>
> Well.when using the file linux.words as a useful master list of
> "words".linux.words is strict ASCII
>

Whatever gave you that idea? I have a large number of dictionaries in
/usr/share/dict, all of them encoded UTF-8 except one (and I don't
know why that is). Even the English ones aren't entirely ASCII.

There is no need to "convert from Unicode to ASCII", which makes no sense.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: IDLE: clearing the screen

2024-06-04 Thread Cameron Simpson via Python-list

On 04Jun2024 22:43, Rob Cliffe  wrote:

import os
def cls(): x=os.system("cls")

Now whenever you type
cls()
it will clear the screen and show the prompt at the top of the screen.

(The reason for the "x=" is: os.system returns a result, in this case 
0.  When you evaluate an expression in the IDE, the IDE prints the 
result.  So without the "x=" you get an extra line at the top of the 
screen containing "0".)


Not if it's in a function, because the IDLE prints the result if it 
isn't None, and your function returns None. So:


def cls():
os.system("cls")

should be just fine.
--
https://mail.python.org/mailman/listinfo/python-list


Re: From JoyceUlysses.txt -- words occurring exactly once

2024-06-04 Thread dn via Python-list

On 31/05/24 14:26, HenHanna via Python-list wrote:

On 5/30/2024 2:18 PM, dn wrote:

On 31/05/24 08:03, HenHanna via Python-list wrote:


Given a text file of a novel (JoyceUlysses.txt) ...

could someone give me a pretty fast (and simple) Python program 
that'd give me a list of all words occurring exactly once?


   -- Also, a list of words occurring once, twice or 3 times



re: hyphenated words    (you can treat it anyway you like)

    but ideally, i'd treat  [editor-in-chief]
    [go-ahead]  [pen-knife]
    [know-how]  [far-fetched] ...
    as one unit.





Split into words - defined as you will.
Use Counter.

Show some (of your) code and we'll be happy to critique...



hard to decide what to do with hyphens
    and apostrophes
  (I'd,  he's,  can't, haven't,  A's  and  B's)


2-step-Process

   1. make a file listing all words (one word per line)

   2.  then, doing the counting.  using
   from collections import Counter



Apologies for lateness - only just able to come back to this.

This issue is not Python, and is not solved by code!

If you/your teacher can't define a "word", the code, any code, will 
almost-certainly be wrong!



One of the interesting aspects of our work is that we can write all 
manner of tests to try to ensure that the code is correct: unit tests, 
integration tests, system tests, acceptance tests, eye-tests, ...


However, there is no such thing as a test (or proof) that statements of 
requirements are complete or correct!

(nor for any other previous stages of the full project life-cycle)

As coders we need to learn to require clear specifications and not 
attempt to read-between-the-lines, use our initiative, or otherwise 'not 
bother the ...'. When there is ambiguity, we should go back to the 
user/client/boss and seek clarification. They are the 
domain/subject-matter experts...


I'm reminded of a cartoon, possibly from some IBM source, first seen in 
black-and-white but here in living-color: 
https://www.monolithic.org/blogs/presidents-sphere/what-the-customer-really-wants


That has been the sad history of programming and dev.projects - wherein 
we are blamed for every short-coming, because no-one else understands 
the nuances of development projects.


If we don't insist on clarity, are we our own worst enemy?


--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list