Re: From JoyceUlysses.txt -- words occurring exactly once
On 2024-05-30 19:26:37 -0700, HenHanna via Python-list wrote: > hard to decide what to do with hyphens >and apostrophes > (I'd, he's, can't, haven't, A's and B's) Especially since the same character is used as both an apostrophe and a closing quotation mark. And while that's pretty unambiguous between to characters it isn't at the end of a word: This is Alex’ house. This type of building is called an ‘Alex’ house. The sentence ‘We are meeting at Alex’ house’ contains an apostrophe. (using proper unicode quotation marks. It get's worse if you stick to ASCII.) Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018 LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as single quotation marks[1], but despite the suggestive names, this is not the common typographical convention, so your texts are unlikely to make this distinction. hp [1] Which I use rarely, anyway. -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Lprint = ( Lisp-style printing ( of lists and strings (etc.) ) in Python )
On 2024-05-30 21:47:14 -0700, HenHanna via Python-list wrote: > [('the', 36225), ('and', 17551), ('of', 16759), ('i', 16696), ('a', 15816), > ('to', 15722), ('that', 11252), ('in', 10743), ('it', 10687)] > > ((the 36225) (and 17551) (of 16759) (i 16696) (a 15816) (to 15722) (that > 11252) (in 10743) (it 10687)) > > > i think the latter is easier-to-read, so i use this code >(by Peter Norvig) This doesn't work well if your strings contain spaces: Lprint( [ ["Just", "three", "words"], ["Just", "three words"], ["Just three", "words"], ["Just three words"], ] ) prints: ((Just three words) (Just three words) (Just three words) (Just three words)) Output is often a compromise between readability and precision. > def lispstr(exp): ># "Convert a Python object back into a Lisp-readable string." > if isinstance(exp, list): This won't work for your example, since you have a list of tuples, not a list of lists and a tuple is not an instance of a list. > return '(' + ' '.join(map(lispstr, exp)) + ')' > else: > return str(exp) > > def Lprint(x): print(lispstr(x)) I like to use pprint, but it's lacking support for user-defined types. I should be able to add a method (maybe __pprint__?) to my classes which handle proper formatting (with line breaks and indentation). hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: From JoyceUlysses.txt -- words occurring exactly once
On 6/1/2024 4:04 AM, Peter J. Holzer via Python-list wrote: On 2024-05-30 19:26:37 -0700, HenHanna via Python-list wrote: hard to decide what to do with hyphens and apostrophes (I'd, he's, can't, haven't, A's and B's) Especially since the same character is used as both an apostrophe and a closing quotation mark. And while that's pretty unambiguous between to characters it isn't at the end of a word: This is Alex’ house. This type of building is called an ‘Alex’ house. The sentence ‘We are meeting at Alex’ house’ contains an apostrophe. (using proper unicode quotation marks. It get's worse if you stick to ASCII.) Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018 LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as single quotation marks[1], but despite the suggestive names, this is not the common typographical convention, so your texts are unlikely to make this distinction. hp [1] Which I use rarely, anyway. My usual approach is to replace punctuation by spaces and then to discard anything remaining that is only one character long (or sometimes two, depending on what I'm working on). Yes, OK, I will miss words like "I". Usually I don't care about them. Make exceptions to the policy if you like. -- https://mail.python.org/mailman/listinfo/python-list
Re: From JoyceUlysses.txt -- words occurring exactly once
On 5/31/24 11:59, Dieter Maurer via Python-list wrote: hmmm, I "sent" this but there was some problem and it remained unsent. Just in case it hasn't All Been Said Already, here's the retry: HenHanna wrote at 2024-5-30 13:03 -0700: Given a text file of a novel (JoyceUlysses.txt) ... could someone give me a pretty fast (and simple) Python program that'd give me a list of all words occurring exactly once? Your task can be split into several subtasks: * parse the text into words This depends on your notion of "word". In the simplest case, a word is any maximal sequence of non-whitespace characters. In this case, you can use `split` for this task This piece is by far "the hard part", because of the ambiguity. For example, if I just say non-whitespace, then I get as distinct words followed by punctuation. What about hyphenation - of which there's both the compound word forms and the ones at the end of lines if the source text has been formatted that way. Are all-lowercase words different than the same word starting with a capital? What about non-initial capitals, as happens a fair bit in modern usage with acronyms, trademarks (perhaps not in Ulysses? :-) ), etc. What about accented letters? If you want what's at least a quick starting point to play with, you could use a very simple regex - a fair amount of thought has gone into what a "word character" is (\w), so it deals with excluding both punctuation and whitespace. import re from collections import Counter with open("JoyceUlysses/txt", "r") as f: wordcount = Counter(re.findall(r'\w+', f.read().lower())) Now you have a Counter object counting all the "words" with their occurrence counts (by this definition) in the document. You can fish through that to answer the questions asked (find entries with a count of 1, 2, 3, etc.) Some people Go Big and use something that actually tries to recognize the language, and opposed to making assumptions from ranges of characters. nltk is a choice there. But at this point it's not really "simple" any longer (though nltk experts might end up disagreeing with that). -- https://mail.python.org/mailman/listinfo/python-list