Re: From JoyceUlysses.txt -- words occurring exactly once

2024-06-01 Thread Peter J. Holzer via Python-list
On 2024-05-30 19:26:37 -0700, HenHanna via Python-list wrote:
> hard to decide what to do with hyphens
>and apostrophes
>  (I'd,  he's,  can't, haven't,  A's  and  B's)

Especially since the same character is used as both an apostrophe and a
closing quotation mark. And while that's pretty unambiguous between to
characters it isn't at the end of a word:

This is Alex’ house.
This type of building is called an ‘Alex’ house.
The sentence ‘We are meeting at Alex’ house’ contains an apostrophe.

(using proper unicode quotation marks. It get's worse if you stick to
ASCII.)

Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018
LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as
single quotation marks[1], but despite the suggestive names, this is not
the common typographical convention, so your texts are unlikely to make
this distinction.

hp

[1] Which I use rarely, anyway.

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Lprint = ( Lisp-style printing ( of lists and strings (etc.) ) in Python )

2024-06-01 Thread Peter J. Holzer via Python-list
On 2024-05-30 21:47:14 -0700, HenHanna via Python-list wrote:
> [('the', 36225), ('and', 17551), ('of', 16759), ('i', 16696), ('a', 15816),
> ('to', 15722), ('that', 11252), ('in', 10743), ('it', 10687)]
> 
> ((the 36225) (and 17551) (of 16759) (i 16696) (a 15816) (to 15722) (that
> 11252) (in 10743) (it 10687))
> 
> 
> i think the latter is easier-to-read, so i use this code
>(by Peter Norvig)

This doesn't work well if your strings contain spaces:

Lprint(
[
["Just", "three", "words"],
["Just", "three words"],
["Just three", "words"],
["Just three words"],
]
)

prints:

((Just three words) (Just three words) (Just three words) (Just three words))

Output is often a compromise between readability and precision.


> def lispstr(exp):
># "Convert a Python object back into a Lisp-readable string."
> if isinstance(exp, list):

This won't work for your example, since you have a list of tuples, not a
list of lists and a tuple is not an instance of a list.

> return '(' + ' '.join(map(lispstr, exp)) + ')'
> else:
> return str(exp)
> 
> def Lprint(x): print(lispstr(x))

I like to use pprint, but it's lacking support for user-defined types. I
should be able to add a method (maybe __pprint__?) to my classes which
handle proper formatting (with line breaks and indentation).

hp
-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: From JoyceUlysses.txt -- words occurring exactly once

2024-06-01 Thread Thomas Passin via Python-list

On 6/1/2024 4:04 AM, Peter J. Holzer via Python-list wrote:

On 2024-05-30 19:26:37 -0700, HenHanna via Python-list wrote:

hard to decide what to do with hyphens
and apostrophes
  (I'd,  he's,  can't, haven't,  A's  and  B's)


Especially since the same character is used as both an apostrophe and a
closing quotation mark. And while that's pretty unambiguous between to
characters it isn't at the end of a word:

 This is Alex’ house.
 This type of building is called an ‘Alex’ house.
 The sentence ‘We are meeting at Alex’ house’ contains an apostrophe.

(using proper unicode quotation marks. It get's worse if you stick to
ASCII.)

Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018
LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as
single quotation marks[1], but despite the suggestive names, this is not
the common typographical convention, so your texts are unlikely to make
this distinction.

 hp

[1] Which I use rarely, anyway.


My usual approach is to replace punctuation by spaces and then to 
discard anything remaining that is only one character long (or sometimes 
two, depending on what I'm working on).  Yes, OK, I will miss words like 
"I". Usually I don't care about them. Make exceptions to the policy if 
you like.


--
https://mail.python.org/mailman/listinfo/python-list


Re: From JoyceUlysses.txt -- words occurring exactly once

2024-06-01 Thread Mats Wichmann via Python-list

On 5/31/24 11:59, Dieter Maurer via Python-list wrote:

hmmm, I "sent" this but there was some problem and it remained unsent. 
Just in case it hasn't All Been Said Already, here's the retry:



HenHanna wrote at 2024-5-30 13:03 -0700:


Given a text file of a novel (JoyceUlysses.txt) ...

could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?


Your task can be split into several subtasks:
  * parse the text into words

This depends on your notion of "word".
In the simplest case, a word is any maximal sequence of non-whitespace
characters. In this case, you can use `split` for this task


This piece is by far "the hard part", because of the ambiguity. For 
example, if I just say non-whitespace, then I get as distinct words 
followed by punctuation. What about hyphenation - of which there's both 
the compound word forms and the ones at the end of lines if the source 
text has been formatted that way.  Are all-lowercase words different 
than the same word starting with a capital?  What about non-initial 
capitals, as happens a fair bit in modern usage with acronyms, 
trademarks (perhaps not in Ulysses? :-) ), etc. What about accented letters?


If you want what's at least a quick starting point to play with, you 
could use a very simple regex - a fair amount of thought has gone into 
what a "word character" is (\w), so it deals with excluding both 
punctuation and whitespace.


import re
from collections import Counter

with open("JoyceUlysses/txt", "r") as f:
wordcount = Counter(re.findall(r'\w+', f.read().lower()))

Now you have a Counter object counting all the "words" with their 
occurrence counts (by this definition) in the document. You can fish 
through that to answer the questions asked (find entries with a count of 
1, 2, 3, etc.)


Some people Go Big and use something that actually tries to recognize 
the language, and opposed to making assumptions from ranges of 
characters.  nltk is a choice there.  But at this point it's not really 
"simple" any longer (though nltk experts might end up disagreeing with 
that).



--
https://mail.python.org/mailman/listinfo/python-list