Re: Printing UTF-8 mail to terminal

2024-11-05 Thread Cameron Simpson via Python-list
On 04Nov2024 13:02, Loris Bennett wrote: OK, so I can do: ## if args.verbose: for k in mail.keys(): print(f"{k}: {mail.get(k)}") print('') print(mail.get_content()) ##

Re: Printing UTF-8 mail to terminal

2024-11-05 Thread Peter J. Holzer via Python-list
On 2024-11-04 13:02:21 +0100, Loris Bennett via Python-list wrote: > "Loris Bennett" writes: > > "Loris Bennett" writes: > >> Cameron Simpson writes: > >>> On 01Nov2024 10:10, Loris Bennett wrote: > >>>>as expected. The n

Re: Printing UTF-8 mail to terminal

2024-11-04 Thread Loris Bennett via Python-list
"Loris Bennett" writes: > "Loris Bennett" writes: > >> Cameron Simpson writes: >> >>> On 01Nov2024 10:10, Loris Bennett wrote: >>>>as expected. The non-UTF-8 text occurs when I do >>>> >>>> mail = EmailMes

Re: Printing UTF-8 mail to terminal

2024-11-04 Thread Loris Bennett via Python-list
"Loris Bennett" writes: > Cameron Simpson writes: > >> On 01Nov2024 10:10, Loris Bennett wrote: >>>as expected. The non-UTF-8 text occurs when I do >>> >>> mail = EmailMessage() >>> mail.set_content(body, cte="quoted-pri

Re: Printing UTF-8 mail to terminal

2024-11-04 Thread Loris Bennett via Python-list
ill create a process that displays a graphical >> > console. The console uses an encoding scheme to represent the text >> > output. I believe that the default on MS Windows is to use some >> > single-byte encoding. This answer from SE family site tells you how to >>

Re: Printing UTF-8 mail to terminal

2024-11-04 Thread Loris Bennett via Python-list
Cameron Simpson writes: > On 01Nov2024 10:10, Loris Bennett wrote: >>as expected. The non-UTF-8 text occurs when I do >> >> mail = EmailMessage() >> mail.set_content(body, cte="quoted-printable") >> ... >> >> if args.verbose: &g

Re: Printing UTF-8 mail to terminal

2024-11-02 Thread Inada Naoki via Python-list
console. The console uses an encoding scheme to represent the text > > output. I believe that the default on MS Windows is to use some > > single-byte encoding. This answer from SE family site tells you how to > > set the console encoding to UTF-8 permanently: > > > http

Re: Printing UTF-8 mail to terminal

2024-11-02 Thread Eli the Bearded via Python-list
In comp.lang.python, Gilmeh Serda wrote: > Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> help('modules') > > Please wait a moment while I gather a list of all available modules... > > Ass

Re: Printing UTF-8 mail to terminal

2024-11-02 Thread Jon Ribbens via Python-list
On 2024-11-01, Eli the Bearded <*@eli.users.panix.com> wrote: > In comp.lang.python, Gilmeh Serda wrote: >> Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux >> Type "help", "copyright", "credits" or "license" for more information. >> >>> help('modules') >> >> Please wai

Re: Printing UTF-8 mail to terminal

2024-11-02 Thread Barry via Python-list
> On 1 Nov 2024, at 22:57, Left Right wrote: > > Does this Windows Terminal support the use > of programs like tmux? I have not tried, but should work. Best to install the terminal app from the MS app store. Most use I make is to ssh into linux systems and stuff like editors. Colour output a

Re: Printing UTF-8 mail to terminal

2024-11-01 Thread Left Right via Python-list
> Windows does now. They implemented this feature over the last few years. > Indeed they took inspiration from how linux does this. > > You might find https://devblogs.microsoft.com/commandline/ has interesting > articles about this. I don't have MS Windows. My wife does, but I don't want to both

Re: Printing UTF-8 mail to terminal

2024-11-01 Thread Barry via Python-list
nux does this. You might find https://devblogs.microsoft.com/commandline/ has interesting articles about this. They also have implemented utf-8 as code page 65001. Barry -- https://mail.python.org/mailman/listinfo/python-list

Re: Printing UTF-8 mail to terminal

2024-11-01 Thread Cameron Simpson via Python-list
On 31Oct2024 21:53, alan.ga...@yahoo.co.uk wrote: On 31/10/2024 20:50, Cameron Simpson via Python-list wrote: If you're just dealing with this directly, use the `quopri` stdlib module: https://docs.python.org/3/library/quopri.html One of the things I love about this list are these little feat

Re: Printing UTF-8 mail to terminal

2024-11-01 Thread Cameron Simpson via Python-list
On 01Nov2024 10:10, Loris Bennett wrote: as expected. The non-UTF-8 text occurs when I do mail = EmailMessage() mail.set_content(body, cte="quoted-printable") ... if args.verbose: print(mail) which is presumably also correct. The question is: What conversion is necessar

Re: Printing UTF-8 mail to terminal

2024-11-01 Thread Cameron Simpson via Python-list
pproach to me. And you are right that encoding for the actual mail which is received is automatically sorted out. If I display the raw email in my client I get the following: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable ... Su

Re: Printing UTF-8 mail to terminal

2024-11-01 Thread Dieter Maurer via Python-list
Loris Bennett wrote at 2024-11-1 10:10 +0100: > ... > mail.set_content(body, cte="quoted-printable") In the line above, you request the content to use the "cte" (= "Content-Transfer-Encoding") "quoted-printable" and consequently, the content is encoded with `quoted-printable`. Maybe, you do not n

Re: Printing UTF-8 mail to terminal

2024-11-01 Thread Loris Bennett via Python-list
=9Cbung. >> >>What do I need to do to prevent the body from getting mangled? > > That looks to me like quoted-printable. This is an encoding for binary > transport of text to make it robust against not 8-buit clean > transports. So your Unicode text is encodings as UTF-8, and

Re: Printing UTF-8 mail to terminal

2024-11-01 Thread Loris Bennett via Python-list
gle-byte encoding. This answer from SE family site tells you how to > set the console encoding to UTF-8 permanently: > https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8 > , which, I believe, will solve your problem with how the text is > di

Re: Printing UTF-8 mail to terminal

2024-11-01 Thread Loris Bennett via Python-list
bungsbetreff >>> >>> Sehr geehrter Herr Dr. Bennett, >>> >>> Dies ist eine =C3=9Cbung. >>> >>>What do I need to do to prevent the body from getting mangled? >> >> That looks to me like quoted-printable. This is an encoding for bi

Re: Printing UTF-8 mail to terminal

2024-10-31 Thread Alan Gauld via Python-list
On 31/10/2024 20:50, Cameron Simpson via Python-list wrote: > That looks to me like quoted-printable. This is an encoding for binary > transport of text to make it robust against not 8-buit clean ... > If you're just dealing with this directly, use the `quopri` stdlib > module: https://docs.py

Re: Printing UTF-8 mail to terminal

2024-10-31 Thread Cameron Simpson via Python-list
make it robust against not 8-buit clean transports. So your Unicode text is encodings as UTF-8, and then that is encoded in quoted-printable for transport through the email system. Your terminal probably accepts UTF-8 - I imagine other German text renders corectly? You need to get the text

Re: Printing UTF-8 mail to terminal

2024-10-31 Thread Left Right via Python-list
run, eg. cmd.exe, it will create a process that displays a graphical console. The console uses an encoding scheme to represent the text output. I believe that the default on MS Windows is to use some single-byte encoding. This answer from SE family site tells you how to set the console e

Printing UTF-8 mail to terminal

2024-10-31 Thread Loris Bennett via Python-list
Hi, I have a command-line program which creates an email containing German umlauts. On receiving the mail, my mail client displays the subject and body correctly: Subject: Übung Sehr geehrter Herr Dr. Bennett, Dies ist eine Übung. So far, so good. However, when I use the --verbose opti

Re: What do these '=?utf-8?' sequences mean in python?

2023-05-09 Thread Cameron Simpson
On 08May2023 12:19, jak wrote: In reality you should also take into account the fact that if the header contains a 'b' instead of a 'q' as a penultimate character, then the rest of the package is converted on the basis64 "=?utf-8?Q?" --> "=?utf-

Re: What do these '=?utf-8?' sequences mean in python?

2023-05-08 Thread Dieter Maurer
Chris Green wrote at 2023-5-6 15:58 +0100: >Chris Green wrote: >> I'm having a real hard time trying to do anything to a string (?) >> returned by mailbox.MaildirMessage.get(). >> >What a twit I am :-) > >Strings are immutable, I have to do:- > >newstring = oldstring.replace("_", " ") The sol

Re: What do these '=?utf-8?' sequences mean in python?

2023-05-08 Thread Keith Thompson
ring.replace("_", " ") > > Job done! Not necessarily. The subject in the original article was: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?= That's some kind of MIME encoding. Just replacing underscores by spaces won't necessarily give yo

Re: What do these '=?utf-8?' sequences mean in python?

2023-05-08 Thread jak
Peter Pearson ha scritto: On Sat, 6 May 2023 14:50:40 +0100, Chris Green wrote: [snip] So, what do those =?utf-8? and ?= sequences mean? Are they part of the string or are they wrapped around the string on output as a way to show that it's utf-8 encoded? Yes, "=?utf-8?" signa

Re: What do these '=?utf-8?' sequences mean in python?

2023-05-08 Thread Peter Pearson
On Sat, 6 May 2023 14:50:40 +0100, Chris Green wrote: [snip] > So, what do those =?utf-8? and ?= sequences mean? Are they part of > the string or are they wrapped around the string on output as a way to > show that it's utf-8 encoded? Yes, "=?utf-8?" signals "MIME

Re: What do these '=?utf-8?' sequences mean in python?

2023-05-08 Thread jak
place("_", " ") Job done! Not necessarily. The subject in the original article was: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?= That's some kind of MIME encoding. Just replacing underscores by spaces won't necessarily give you anyth

What do these '=?utf-8?' sequences mean in python?

2023-05-08 Thread Chris Green
non-ASCII characters in it) is:- =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?= Whatever I try I am unable to change the underscore characters in the above string back to spaces. So, what do those =?utf-8? and ?= sequences mean? Are they part of the string o

Re: What do these '=?utf-8?' sequences mean in python?

2023-05-08 Thread Chris Green
Strings are immutable, I have to do:- > > > > newstring = oldstring.replace("_", " ") > > > > Job done! > > Not necessarily. > > The subject in the original article was: > =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)

Re: What do these '=?utf-8?' sequences mean in python?

2023-05-08 Thread Chris Green
Chris Green wrote: > I'm having a real hard time trying to do anything to a string (?) > returned by mailbox.MaildirMessage.get(). > What a twit I am :-) Strings are immutable, I have to do:- newstring = oldstring.replace("_", " ") Job done! -- Chris Green · -- https://mail.python.org/m

Re: UTF-8 and latin1

2022-10-25 Thread Chris Angelico
;> path = pathlib.Path( name ) > >> for encoding in( "utf_8", "cp1252", "latin_1" ): > >> try: > >> with path.open( encoding=encoding, errors="strict" )as file: > > > > I also read a book which claimed that the tkinter.T

Re: UTF-8 and latin1

2022-10-25 Thread Barry Scott
ot;, "latin_1" ): >> try: >> with path.open( encoding=encoding, errors="strict" )as file: > > I also read a book which claimed that the tkinter.Text > widget would accept bytes and guess whether these are > encoded in UTF-8 or "ISO 8859-1&quo

Re: UTF-8 and latin1

2022-08-19 Thread Dennis Lee Bieber
On Thu, 18 Aug 2022 11:33:59 -0700, Tobiah declaimed the following: > >So how does this break down? When a person enters >Montréal, Quebéc into a form field, what are they >doing on the keyboard to make that happen? As the >string sits there in the text box, is it latin

回复: UTF-8 and latin1

2022-08-19 Thread Daniel Lee
Thanks! 发件人: Stefan Ram<mailto:r...@zedat.fu-berlin.de> 发送时间: 2022年8月19日 6:23 收件人: python-list@python.org<mailto:python-list@python.org> 主题: Re: UTF-8 and latin1 Tobiah writes: > When a person enters >Montréal, Quebéc into a form field, what are

Re: UTF-8 and latin1

2022-08-18 Thread Chris Angelico
ntréal, Quebéc into a form field, what are they > doing on the keyboard to make that happen? As the > string sits there in the text box, is it latin1, or utf-8 > or something else? How does the browser know what > sort of data it has in that text box? > As it sits there in the text box

Re: UTF-8 and latin1

2022-08-18 Thread Jon Ribbens via Python-list
#x27;e'. If they're using a French ("azerty") keyboard then I think they can enter it by holding 'shift' and typing '2'. > As the string sits there in the text box, is it latin1, or utf-8 > or something else? That depends on which browser you're

Re: UTF-8 and latin1

2022-08-18 Thread Tobiah
there in the text box, is it latin1, or utf-8 or something else? How does the browser know what sort of data it has in that text box? -- https://mail.python.org/mailman/listinfo/python-list

Re: UTF-8 and latin1

2022-08-18 Thread Jon Ribbens via Python-list
On 2022-08-18, Tobiah wrote: >> Generally speaking browser submisisons were/are supposed to be sent >> using the same encoding as the page, so if you're sending the page >> as "latin1" then you'll see that a fair amount I should think. If you >> send i

Re: UTF-8 and latin1

2022-08-18 Thread Jon Ribbens via Python-list
;>> some_string.decode('latin1') >>> to get unicode that I can use with xlsxwriter, >>> or put in the header of a web page to display >>> European characters correctly. But normally UTF-8 is recommended as >>> the encoding to use today. latin1

Re: UTF-8 and latin1

2022-08-18 Thread Tobiah
Generally speaking browser submisisons were/are supposed to be sent using the same encoding as the page, so if you're sending the page as "latin1" then you'll see that a fair amount I should think. If you send it as "utf-8" then you'll get 100% utf-8 back

Re: UTF-8 and latin1

2022-08-18 Thread Jon Ribbens via Python-list
On 2022-08-17, Tobiah wrote: >> That has already been decided, as much as it ever can be. UTF-8 is >> essentially always the correct encoding to use on output, and almost >> always the correct encoding to assume on input absent any explicit >> indication of another

Re: UTF-8 and latin1

2022-08-17 Thread dn
hod. ("bytes" objects do.) > >> to get unicode that I can use with xlsxwriter, >> or put in the header of a web page to display >> European characters correctly. > > |You should always use the UTF-8 character encoding. (Remember > |that this means you also

Re: UTF-8 and latin1

2022-08-17 Thread Barry
;) >> to get unicode that I can use with xlsxwriter, >> or put in the header of a web page to display >> European characters correctly. But normally UTF-8 is recommended as >> the encoding to use today. latin1 works correctly more often when I >> am using data from

Re: UTF-8 and latin1

2022-08-17 Thread Tobiah
That has already been decided, as much as it ever can be. UTF-8 is essentially always the correct encoding to use on output, and almost always the correct encoding to assume on input absent any explicit indication of another encoding. (e.g. the HTML "standard" says that all HTML files m

Re: UTF-8 and latin1

2022-08-17 Thread Tobiah
On 8/17/22 08:33, Stefan Ram wrote: Tobiah writes: I get data from various sources; client emails, spreadsheets, and data from web applications. I find that I can do some_string.decode('latin1') Strings have no "decode" method. ("bytes" objects do.) I'm using 2.7. Maybe that's why.

UTF-8 and latin1

2022-08-17 Thread Tobiah
I get data from various sources; client emails, spreadsheets, and data from web applications. I find that I can do some_string.decode('latin1') to get unicode that I can use with xlsxwriter, or put in the header of a web page to display European characters correctly. But normall

Re: UTF-8 and latin1

2022-08-17 Thread Jon Ribbens via Python-list
> European characters correctly. But normally UTF-8 is recommended as > the encoding to use today. latin1 works correctly more often when I > am using data from the wild. It's frustrating that I have to play > a guessing game to figure out how to use incoming text. I'

Re: 'äÄöÖüÜ' in Unicode (utf-8)

2022-04-07 Thread Anssi Saari
Dennis Lee Bieber writes: > On Fri, 1 Apr 2022 03:59:32 +1100, Chris Angelico > declaimed the following: > > >>That's jmf. Ignore him. He knows nothing about Unicode and is >>determined to make everyone aware of that fact. >> >>He got blocked from the mailing list ages ago, and I don't think >>a

Re: 'äÄöÖüÜ' in Unicode (utf-8)

2022-04-01 Thread Chris Angelico
On Fri, 1 Apr 2022 at 11:16, Dennis Lee Bieber wrote: > > On Fri, 1 Apr 2022 03:59:32 +1100, Chris Angelico > declaimed the following: > > > >That's jmf. Ignore him. He knows nothing about Unicode and is > >determined to make everyone aware of that fact. > > > >He got blocked from the mailing lis

Re: 'äÄöÖüÜ' in Unicode (utf-8)

2022-03-31 Thread Dennis Lee Bieber
On Fri, 1 Apr 2022 03:59:32 +1100, Chris Angelico declaimed the following: >That's jmf. Ignore him. He knows nothing about Unicode and is >determined to make everyone aware of that fact. > >He got blocked from the mailing list ages ago, and I don't think >anyone's regretted it. > Ah yes.

Re: 'äÄöÖüÜ' in Unicode (utf-8)

2022-03-31 Thread Chris Angelico
On Fri, 1 Apr 2022 at 03:45, Dennis Lee Bieber wrote: > > On Thu, 31 Mar 2022 00:36:10 -0700 (PDT), moi > declaimed the following: > > >>>> 'äÄöÖüÜ'.encode('utf-8') > >b'\xc3\xa4\xc3\x84\xc3\xb6\xc3\x96\xc3\xbc\xc3\x9c' > >&

Re: 'äÄöÖüÜ' in Unicode (utf-8)

2022-03-31 Thread Dennis Lee Bieber
On Thu, 31 Mar 2022 00:36:10 -0700 (PDT), moi declaimed the following: >>>> 'äÄöÖüÜ'.encode('utf-8') >b'\xc3\xa4\xc3\x84\xc3\xb6\xc3\x96\xc3\xbc\xc3\x9c' >>>> len('äÄöÖüÜ'.encode('utf-8')) >12 >>>

Re: Reversible malformed UTF-8 to malformed UTF-16 encoding

2019-03-19 Thread Florian Weimer
* MRAB: > On 2019-03-19 20:32, Florian Weimer wrote: >> I've seen occasional proposals like this one coming up: >> >> | I therefore suggested 1999-11-02 on the unic...@unicode.org mailing >> | list the following approach. Instead of using U+FFFD, simply encode &

Re: Reversible malformed UTF-8 to malformed UTF-16 encoding

2019-03-19 Thread MRAB
On 2019-03-19 20:32, Florian Weimer wrote: I've seen occasional proposals like this one coming up: | I therefore suggested 1999-11-02 on the unic...@unicode.org mailing | list the following approach. Instead of using U+FFFD, simply encode | malformed UTF-8 sequences as malformed U

Reversible malformed UTF-8 to malformed UTF-16 encoding

2019-03-19 Thread Florian Weimer
I've seen occasional proposals like this one coming up: | I therefore suggested 1999-11-02 on the unic...@unicode.org mailing | list the following approach. Instead of using U+FFFD, simply encode | malformed UTF-8 sequences as malformed UTF-16 sequences. Malformed | UTF-8 sequences co

curses textpad.py UTF-8 support

2019-01-09 Thread elig0n
is non existent. Unicode input text won't show up. It probably needs to be rewritten with get_wch() as was suggested in the following SO question before get_wch() was implemented, together with proper key code parsing (in do_command()) and probably more as to prevent breakage [ https://stackove

help! PyQt4 and UTF-8

2018-08-14 Thread inhahe
I can display UTF-8 when I use wxPython: -- import wx app = wx.App() s = 'testing\xf0\x9f\x98\x80' frame = wx.Frame(None, wx.ID_ANY) font = wx.Font("Arial") textbox = wx.TextCtrl(frame, id=wx.ID_ANY) textbox.SetFont(font) textbox.WriteText(s) frame.Show() app.MainLoop() -

Re: Can utf-8 encoded character contain a byte of TAB?

2018-01-15 Thread Chris Angelico
On Tue, Jan 16, 2018 at 8:29 AM, Peng Yu wrote: >> Just to be clear, TAB *only* appears in utf-8 as the encoding for the actual >> TAB character, not as a part of any other character's encoding. The only >> bytes that can appear in the utf-8 encoding of non-ascii char

Re: Can utf-8 encoded character contain a byte of TAB?

2018-01-15 Thread Peng Yu
> Just to be clear, TAB *only* appears in utf-8 as the encoding for the actual > TAB character, not as a part of any other character's encoding. The only > bytes that can appear in the utf-8 encoding of non-ascii characters are > starting with 0xC2 through 0xF4, followed by on

Re: Can utf-8 encoded character contain a byte of TAB?

2018-01-15 Thread Random832
On Mon, Jan 15, 2018, at 09:35, Peter Otten wrote: > Peng Yu wrote: > > > Can utf-8 encoded character contain a byte of TAB? > > Yes; ascii is a subset of utf8. > > If you want to allow fields containing TABs in a file where TAB is also the > field separator you need

Re: Can utf-8 encoded character contain a byte of TAB?

2018-01-15 Thread Peter Otten
Peng Yu wrote: > Can utf-8 encoded character contain a byte of TAB? Yes; ascii is a subset of utf8. Python 2.7.6 (default, Nov 23 2017, 15:49:48) [GCC 4.8.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >&g

Can utf-8 encoded character contain a byte of TAB?

2018-01-15 Thread Peng Yu
Hi, I use the following code to process TSV input. $ printf '%s\t%s\n' {1..10} | ./main.py ['1', '2'] ['3', '4'] ['5', '6'] ['7', '8'] ['9', '10'] $ cat main.py #!/usr/bin/env p

Re: PEP 393 vs UTF-8 Everywhere

2017-01-22 Thread Steve D'Aprano
gt;> I'm afraid Python's choice may lead to exploitable security holes in >>> Python programs. >> >> Feel free to back up that with an actual demonstration of an exploit, >> rather than just FUD. > > It might come as a surprise to programmers that pathnames can

Re: PEP 393 vs UTF-8 Everywhere

2017-01-22 Thread Marko Rauhamaa
Steve D'Aprano : > On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote: > >> Steve D'Aprano : >> >>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: Also, [surrogates] don't exist as Unicode code points. Python shouldn't allow surrogate characters in strings. >>> >>> Not quite. This

Re: PEP 393 vs UTF-8 Everywhere

2017-01-22 Thread Steve D'Aprano
On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote: > Steve D'Aprano : > >> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: >>> Also, [surrogates] don't exist as Unicode code points. Python >>> shouldn't allow surrogate characters in strings. >> >> Not quite. This is where it gets a bit messy

Re: PEP 393 vs UTF-8 Everywhere

2017-01-22 Thread Marko Rauhamaa
Steve D'Aprano : > On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: >> Also, [surrogates] don't exist as Unicode code points. Python >> shouldn't allow surrogate characters in strings. > > Not quite. This is where it gets a bit messy and confusing. The bottom > line is: surrogates *are* code po

Re: PEP 393 vs UTF-8 Everywhere

2017-01-22 Thread Marko Rauhamaa
eryk sun : > On Sat, Jan 21, 2017 at 8:21 PM, Pete Forman wrote: >> Marko Rauhamaa writes: >> py> low = '\uDC37' >>> >>> That should raise a SyntaxError exception. >> >> Quite. [...] > > CPython allows surrogate codes for use with the "surrogateescape" and > "surrogatepass" error handlers,

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steven D'Aprano
ng only makes sense (for every > use-case I've been able to come up with) in the context of known > offsets like you describe with tell(). I'm sorry, I find it hard to believe that you've never needed to add or subtract 1 from a given offset returned by find() or equiv

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sun, 22 Jan 2017 07:21 am, Pete Forman wrote: > Marko Rauhamaa writes: > >>> py> low = '\uDC37' >> >> That should raise a SyntaxError exception. > > Quite. My point was that with older Python on a narrow build (Windows > and Mac) you need to understand that you are using UTF-16 rather than >

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: > Pete Forman : > >> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8 >> and UTF-32. > > Also, they don't exist as Unicode code points. Python shouldn't allow > surrogate characters

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Tim Chase
Right, so here, you've done a (likely linear, but however you get here) search, which then makes sense to use this opaque "offset" token for slicing purposes: > py> stuff = text[offset:] > py> assert stuff == "фxx" > That works fine whether indexing refers t

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Matt Ruffalo
On 2017-01-21 10:50, Pete Forman wrote: > Thanks for a very thorough reply, most useful. I'm going to pick you up > on the above, though. > > Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8 > and UTF-32. The rules for UTF-8 were tightened up in Unicode 4

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread eryk sun
#x27;ascii', 'surrogateescape') b'\x81' This error handler is required by CPython on POSIX to handle arbitrary bytes in file-system paths. For example, when running with LANG=C: >>> sys.getfilesystemencoding() 'ascii' >>> os.listdir(b'.')

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Pete Forman
Marko Rauhamaa writes: >> py> low = '\uDC37' > > That should raise a SyntaxError exception. Quite. My point was that with older Python on a narrow build (Windows and Mac) you need to understand that you are using UTF-16 rather than Unicode. On a wide build or Python 3.3+ then all is rosy. (At th

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Marko Rauhamaa
Pete Forman : > Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8 > and UTF-32. Also, they don't exist as Unicode code points. Python shouldn't allow surrogate characters in strings. Thus the range of code points that are available for use as charac

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Jussi Piitulainen
Chris Angelico writes: > On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen wrote: >> Steve D'Aprano writes: >> >> [snip] >> >>> You could avoid that error by increasing the offset by the right >>> amount: >>> >>> stuff = text[

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Chris Angelico
On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen wrote: > Steve D'Aprano writes: > > [snip] > >> You could avoid that error by increasing the offset by the right >> amount: >> >> stuff = text[offset + len("ф".encode('utf-8'):] >&g

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Jussi Piitulainen
Steve D'Aprano writes: [snip] > You could avoid that error by increasing the offset by the right > amount: > > stuff = text[offset + len("ф".encode('utf-8'):] > > which is awful. I believe that's what Go and Julia expect you to do. Julia provides

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Pete Forman
Steve D'Aprano writes: > [...] > Another factor which I didn't see discussed anywhere is that Python > strings treat surrogates as normal code points. I believe that would > be troublesome for a UTF-8 implementation: > > py> '\uDC37'.encode('utf-8&#

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
de points or bytes. py> "αβγдлфxx".find("ф") 5 py> "αβγдлфxx".encode('utf-8').find("ф".encode('utf-8')) 10 Either way, you get the expected result. However: py> stuff = text[offset + 1:] py> assert stuff == "xx" Tha

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sat, 21 Jan 2017 09:35 am, Pete Forman wrote: > Can anyone point me at a rationale for PEP 393 being incorporated in > Python 3.3 over using UTF-8 as an internal string representation? I've read over the PEP, and the email discussion, and there is very little mention of UTF-8, and

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Tim Chase
On 2017-01-21 11:58, Chris Angelico wrote: > So, how could you implement this function? The current > implementation maintains an index - an integer position through the > string. It repeatedly requests the next character as string[idx], > and can also slice the string (to check for keywords like "

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Paul Rubin
Chris Angelico writes: > You can't do a look-ahead with a vanilla string iterator. That's > necessary for a lot of parsers. For JSON? For other parsers you usually have a tokenizer that reads characters with maybe 1 char of lookahead. > Yes, which gives a two-level indexing (first find the stra

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Jussi Piitulainen
Chris Angelico writes: > On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote: >> I was asserting that most useful operations on strings start from >> index 0. The r* operations would not be slowed down that much as >> UTF-8 has the useful property that attempting to interpre

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
On Sat, Jan 21, 2017 at 5:01 PM, Paul Rubin wrote: > Chris Angelico writes: >> decoding JSON... the scanner, which steps through the string and >> does the actual parsing. ... >> The only way for it to be fast enough would be to have some sort of >> retainable string iterator, which means exposin

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Paul Rubin
I'm missing something. Of course a json parser should use it, though who uses the non-C json parser anyway these days? [Chris Kaynor writes:] > rfind/rsplit/rindex/rstrip and the other related reverse > functions would require walking the string from start to end, rather > than short-circu

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread MRAB
7;m not getting paid for the work, it's purely voluntary. PEP 393 / Python 3.3 required extension writers to revisit their access to strings. My explicit question was about why PEP 393 was adopted to replace the deficient old implementations rather than another approach. The implicit questio

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
of astral characters (plus *maybe* a faster encode-to-UTF-8; you wouldn't get a faster decode-from-UTF-8, because you still need to check that the byte sequence is valid). Can you show a use-case that would be materially improved by UTF-8? ChrisA -- https://mail.python.org/mailman/listinfo/python-list

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote: > I was asserting that most useful operations on strings start from index > 0. The r* operations would not be slowed down that much as UTF-8 has the > useful property that attempting to interpret from a byte that is not at > th

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Pete Forman
ace the deficient old implementations rather than another approach. The implicit question is whether a UTF-8 internal representation should replace that of PEP 393. -- Pete Forman -- https://mail.python.org/mailman/listinfo/python-list

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Pete Forman
Chris Kaynor writes: > On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote: >> Can anyone point me at a rationale for PEP 393 being incorporated in >> Python 3.3 over using UTF-8 as an internal string representation? >> I've found good articles by Nick Coghlan, Armin

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread MRAB
On 2017-01-20 23:06, Chris Kaynor wrote: On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote: Can anyone point me at a rationale for PEP 393 being incorporated in Python 3.3 over using UTF-8 as an internal string representation? I've found good articles by Nick Coghlan, Armin Ronache

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Kaynor
. On Fri, Jan 20, 2017 at 3:15 PM, Thomas Nyberg wrote: > On 01/20/2017 03:06 PM, Chris Kaynor wrote: >> >> >> [...snip...] >> >> -- >> Chris Kaynor >> > > I was able to delete my response which was a wholly contained subset of this > one. :) > > > But I have one extra question. Is string

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
ally > change if it weren't for all the reasons you mentioned.) I found this which > at details (if not explicitly "guarantees") the complexity properties of > other datatypes: > No, it isn't; this question came up in the context of MicroPython, which chose

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Thomas Nyberg
On 01/20/2017 03:06 PM, Chris Kaynor wrote: [...snip...] -- Chris Kaynor I was able to delete my response which was a wholly contained subset of this one. :) But I have one extra question. Is string indexing guaranteed to be constant-time for python? I thought so, but I couldn't

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Kaynor
On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote: > Can anyone point me at a rationale for PEP 393 being incorporated in > Python 3.3 over using UTF-8 as an internal string representation? I've > found good articles by Nick Coghlan, Armin Ronacher and others on the > matter

PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Pete Forman
Can anyone point me at a rationale for PEP 393 being incorporated in Python 3.3 over using UTF-8 as an internal string representation? I've found good articles by Nick Coghlan, Armin Ronacher and others on the matter. What I have not found is discussion of pros and cons of alternatives to th

Re: UTF-8 Encoding Error

2016-12-29 Thread subhabangalore
CII, you are > now running a broken system with subtle bugs, including in data structures > as fundamental as dicts. > > The standard behaviour: > > py> d = {u'café': 1} > py> for key in d: > ... print key == 'caf\xc3\xa9' > ... > False &g

  1   2   3   4   5   6   7   8   9   >