Re: [Python-Dev] Bytes path support
On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote: > On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal > wrote: > > This brings up the other key problem. If file names are (almost) > > arbitrary bytes, how do you write one to/read one from a text file > > with a particular encoding? ( or for that matter display it on a > > terminal) > >There is no such thing as an encoding of text files. I don't understand this comment. It seems to me that *text* files have to have an encoding, otherwise you can't interpret the contents as text. Files, of course, only contain bytes, but to be treated as bytes you need some way of transforming byte N to char C (or multiple bytes to C), which is an encoding. Perhaps you just mean that encodings are not recorded in the text file itself? To answer Chris' question, you typically cannot include arbitrary bytes in text files, and displaying them to the user is likewise problematic. The usual solution is to support some form of escaping, like \t #x0A; or %0D, to give a few examples. -- Steven ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
Am 22.08.14 01:56, schrieb Glenn Linderman: > 0 and 47 are certainly originally derived from ASCII. However, there > could be lots of encodings that are not ASCII compatible (but in > practice, probably very few, since most encodings _are_ ASCII > compatible) that could be fit those constraints. > > So while as a technical matter, Cameron is correct that Unix only treats > 0 & 47 as special, and that is insufficient to declare that encodings > must be ASCII compatible, as a practical matter, since most encodings > are ASCII compatible anyway, it would be hard to find very many that > could be used successfully with Unix file names that are not ASCII > compatible, that could comply with the 0 & 47 requirements. More importantly, existing encodings that are distinctively *not* ASCII compatible (e.g. the EBCDIC ones) do not put the slash into 47 (instead, it is at 91 at EBCDIC, 47 is the BEL control character). There are boundary cases, of course. VISCII is "mostly ASCII compatible", putting graphic characters into some of the control characters, but using those that aren't used in ASCII, anyway. And then there is the YUSCII family of encodings, which definitely is not ASCII compatible, as it does not contain Latin characters, but still puts the / into 47 (and also keeps the ASCII digits and special characters in their positions). There is also SI 960, which has the slash, the ASCII uppercase letters, digits and special characters, but replaces the lower-case characters with Hebrew. So yes, Unix doesn't mandate ASCII-compatible encodings; but it still mandates ASCII-inspired encodings. I wonder how you would run "gcc", though, on an SI 960 system; you'ld have to type חדד. Regards, Martin ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
Hi! On Sat, Aug 23, 2014 at 01:19:14AM +1000, Steven D'Aprano wrote: > On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote: > > On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal > > wrote: > > > This brings up the other key problem. If file names are (almost) > > > arbitrary bytes, how do you write one to/read one from a text file > > > with a particular encoding? ( or for that matter display it on a > > > terminal) > > > >There is no such thing as an encoding of text files. > > I don't understand this comment. It seems to me that *text* files have > to have an encoding, otherwise you can't interpret the contents as text. What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional. > Files, of course, only contain bytes, but to be treated as bytes you > need some way of transforming byte N to char C (or multiple bytes to C), > which is an encoding. But you don't need to treat the entire file in one encoding. Strange characters are clearly visible so you can interpret them differently. I am very much trained to distinguish koi8, cp1251 and utf-8 texts; I cannot translate them mentally but I can recognize them. > Perhaps you just mean that encodings are not recorded in the text file > itself? Yes, that too. > To answer Chris' question, you typically cannot include arbitrary > bytes in text files, and displaying them to the user is likewise > problematic As a person who view utf-8 files in koi8 fonts (and vice versa) every day I'd argue. (-: Oleg. -- Oleg Broytmanhttp://phdru.name/[email protected] Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Summary of Python tracker Issues
ACTIVITY SUMMARY (2014-08-15 - 2014-08-22) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open4621 (+19) closed 29399 (+28) total 34020 (+47) Open issues with patches: 2179 Issues opened (41) == #22207: Test for integer overflow on Py_ssize_t: explicitly cast to si http://bugs.python.org/issue22207 opened by haypo #22208: tarfile can't add in memory files (reopened) http://bugs.python.org/issue22208 opened by markgrandi #22209: Idle: add better access to extension information http://bugs.python.org/issue22209 opened by terry.reedy #22210: pdb-run-restarting-a-pdb-session http://bugs.python.org/issue22210 opened by zhengxiexie #22211: Remove VMS specific code in expat.h & xmlrole.h http://bugs.python.org/issue22211 opened by John.Malmberg #22212: zipfile.py fails if zlib.so module fails to build. http://bugs.python.org/issue22212 opened by John.Malmberg #22213: pyvenv style virtual environments unusable in an embedded syst http://bugs.python.org/issue22213 opened by grahamd #22214: Tkinter: Don't stringify callbacks arguments http://bugs.python.org/issue22214 opened by serhiy.storchaka #22215: "embedded NUL character" exceptions http://bugs.python.org/issue22215 opened by serhiy.storchaka #22216: smtplip STARTTLS fails at second attampt due to unsufficiant q http://bugs.python.org/issue22216 opened by zvyn #22217: Reprs for zipfile classes http://bugs.python.org/issue22217 opened by serhiy.storchaka #22218: Fix more compiler warnings "comparison between signed and unsi http://bugs.python.org/issue22218 opened by haypo #22219: python -mzipfile fails to add empty folders to created zip http://bugs.python.org/issue22219 opened by Antony.Lee #0: Ttk extensions test failure http://bugs.python.org/issue0 opened by serhiy.storchaka #1: ast.literal_eval confused by coding declarations http://bugs.python.org/issue1 opened by jorgenschaefer #2: dtoa.c: remove custom memory allocator http://bugs.python.org/issue2 opened by haypo #3: argparse not including '--' arguments in previous optional REM http://bugs.python.org/issue3 opened by Jurko.GospodnetiÄ #5: Add SQLite support to http.cookiejar http://bugs.python.org/issue5 opened by demian.brecht #6: Refactor dict result handling in Tkinter http://bugs.python.org/issue6 opened by serhiy.storchaka #7: Simplify tarfile iterator http://bugs.python.org/issue7 opened by serhiy.storchaka #8: Adapt bash readline operate-and-get-next function http://bugs.python.org/issue8 opened by lelit #9: wsgiref doesn't appear to ever set REMOTE_HOST in the environ http://bugs.python.org/issue9 opened by alex #22231: httplib: unicode url will cause an ascii codec error when comb http://bugs.python.org/issue22231 opened by Bob.Chen #22232: str.splitlines splitting on none-\r\n characters http://bugs.python.org/issue22232 opened by scharron #22233: http.client splits headers on none-\r\n characters http://bugs.python.org/issue22233 opened by scharron #22234: urllib.parse.urlparse accepts any falsy value as an url http://bugs.python.org/issue22234 opened by Ztane #22235: httplib: TypeError with file() object in ssl.py http://bugs.python.org/issue22235 opened by erob #22236: Do not use _default_root in Tkinter tests http://bugs.python.org/issue22236 opened by serhiy.storchaka #22237: sorted() docs should state that the sort is stable http://bugs.python.org/issue22237 opened by Wilfred.Hughes #22239: asyncio: nested event loop http://bugs.python.org/issue22239 opened by djarb #22240: argparse support for "python -m module" in help http://bugs.python.org/issue22240 opened by tebeka #22241: strftime/strptime round trip fails even for UTC datetime objec http://bugs.python.org/issue22241 opened by akira #22242: Doc fix in the Import section in language reference. http://bugs.python.org/issue22242 opened by jon.poler #22243: Documentation on try statement incorrectly implies target of e http://bugs.python.org/issue22243 opened by mwilliamson #22244: load_verify_locations fails to handle unicode paths on Python http://bugs.python.org/issue22244 opened by alex #22246: add strptime(s, '%s') http://bugs.python.org/issue22246 opened by akira #22247: More incomplete module.__all__ lists http://bugs.python.org/issue22247 opened by vadmium #22248: urllib.request.urlopen raises exception when 30X-redirect url http://bugs.python.org/issue22248 opened by tomasgroth #22249: Possibly incorrect example is given for socket.getaddrinfo() http://bugs.python.org/issue22249 opened by Alexander.Patrakov #22250: unittest lowercase methods http://bugs.python.org/issue22250 opened by simonzack #22251: Various markup errors in documentation http://bugs.python.org/issue22251 opened by berker.peksag Most rece
Re: [Python-Dev] Bytes path support
On 8/22/2014 8:51 AM, Oleg Broytman wrote: What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional. That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings. If it is named .html and served by the server as UTF-8, then the server is misconfigured, or the file is incorrectly populated. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman wrote: > On 8/22/2014 8:51 AM, Oleg Broytman wrote: > >What encoding does have a text file (an HTML, to be precise) with > >text in utf-8, ads in cp1251 (ad blocks were included from different > >files) and comments in koi8-r? > >Well, I must admit the HTML was rather an exception, but having a > >text file with some strange characters (binary strings, or paragraphs > >in different encodings) is not that exceptional. > That's not a text file. That's a binary file containing (hopefully > delimited, and documented) sections of encoded text in different > encodings. Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex than strict text/binary dichotomy in Python3. Oleg. -- Oleg Broytmanhttp://phdru.name/[email protected] Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
On 8/22/2014 9:52 AM, Oleg Broytman wrote: On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman wrote: On 8/22/2014 8:51 AM, Oleg Broytman wrote: What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional. That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings. Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex than strict text/binary dichotomy in Python3. Oleg. I was not declaring your file not to be a "text file" from any definition obtained from Python3 documentation, just from a common sense definition of "text file". Looking at it from Python3, though, it is clear that when opening a file in "text" mode, an encoding may be specified or will be assumed. That is one encoding, applying to the whole file, not 3 encodings, with declarations on when to switch between them. So I think, in general, Python3 assumes or defines a definition of text file that matches my "common sense" definition. Also, if it is an HTML file, I doubt the browser will use multiple different encodings when interpreting it, so it is not clear that the file is of practical use for its intended purpose if it contains text in multiple different encodings, but is served using only a single encoding, unless there is javascript or some programming in the browser that reencodes the data. On the other hand, Python3 provides various facilities for working with such files. The first I'll mention is the one that follows from my description of what your file really is: Python3 allows opening files in binary mode, and then decoding various sections of it using whatever encoding you like, using the bytes.decode() operation on various sections of the file. Determination of which sections are in which encodings is beyond the scope of this description of the technique, and is application dependent. The second is to specify an error handler, that, like you, is trained to recognize the other encodings and convert them appropriately. I'm not aware that such an error handler has been or could be written, myself not having your training. The third is to specify the UTF-8 with the surrogate escape error handler. This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as smart as you, could perhaps be developed to detect and manipulate the resulting "lone surrogate" codes in meaningful ways, or could simply allow them to ride along without interpretation, and be emitted as the original, into other files. There may be other technique that I am not aware of. Glenn ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman
wrote:
> On 8/22/2014 9:52 AM, Oleg Broytman wrote:
> >On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman
> > wrote:
> >>On 8/22/2014 8:51 AM, Oleg Broytman wrote:
> >>>What encoding does have a text file (an HTML, to be precise) with
> >>>text in utf-8, ads in cp1251 (ad blocks were included from different
> >>>files) and comments in koi8-r?
> >>>Well, I must admit the HTML was rather an exception, but having a
> >>>text file with some strange characters (binary strings, or paragraphs
> >>>in different encodings) is not that exceptional.
> >>That's not a text file. That's a binary file containing (hopefully
> >>delimited, and documented) sections of encoded text in different
> >>encodings.
> >Allow me to disagree. For me, this is a text file which I can (and
> >do) view with a pager, edit with a text editor, list on a console,
> >search with grep and so on. If it is not a text file by strict Python3
> >standards then these standards are too strict for me. Either I find a
> >simple workaround in Python3 to work with such texts or find a different
> >tool. I cannot avoid such files because my reality is much more complex
> >than strict text/binary dichotomy in Python3.
>
> I was not declaring your file not to be a "text file" from any
> definition obtained from Python3 documentation, just from a common
> sense definition of "text file".
And in my opinion those files are perfect text. The files consist of
lines separated by EOL characters (not necessary EOL characters of my OS
because it could be a text file produced in a different OS), lines
consist of words and words of characters.
> Looking at it from Python3, though, it is clear that when opening a
> file in "text" mode, an encoding may be specified or will be
> assumed. That is one encoding, applying to the whole file, not 3
> encodings, with declarations on when to switch between them. So I
> think, in general, Python3 assumes or defines a definition of text
> file that matches my "common sense" definition.
I don't have problems with Python3 text. I have problems with Python3
trying to get rid of byte strings and treating bytes as strict non-text.
> On the other hand, Python3 provides various facilities for working
> with such files.
>
> The first I'll mention is the one that follows from my description
> of what your file really is: Python3 allows opening files in binary
> mode, and then decoding various sections of it using whatever
> encoding you like, using the bytes.decode() operation on various
> sections of the file. Determination of which sections are in which
> encodings is beyond the scope of this description of the technique,
> and is application dependent.
This is perhaps the most promising approach. If I can open a text
file in binary mode, iterate it line by line, split every line of
non-ascii bytes with .split() and process them that'd satisfy my needs.
But still there are dragons. If I read a filename from such file I
read it as bytes, not str, so I can only use low-level APIs to
manipulate with those filenames. Pity.
Let see a perfectly normal situation I am quite often in. A person
sent me a directory full of MP3 files. The transport doesn't matter; it
could be FTP, or rsync, or a zip file sent by email, or bittorrent. What
matters is that filenames and content are in alien encodings. Most often
it's cp1251 (the encoding used in Russian Windows) but can be koi8 or
utf8. There is a playlist among the files -- a text file that lists MP3
files, every file on a single line; usually with full paths
("C:\Audio\some.mp3").
Now I want to read filenames from the file and process the filenames
(strip paths) and files (verify existing of files, or renumber the files
or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are
also in cp1251 of utf-8 encoding]...whatever). I don't know the encoding
of the playlist but I know it corresponds to the encoding of filenames
so I can expect those files exist on my filesystem; they have strangely
looking unreadable names but they exist.
Just a small example of why I do want to process filenames from a
text file in an alien encoding. Without knowing the encoding in advance.
> The second is to specify an error handler, that, like you, is
> trained to recognize the other encodings and convert them
> appropriately. I'm not aware that such an error handler has been or
> could be written, myself not having your training.
>
> The third is to specify the UTF-8 with the surrogate escape error
> handler. This allows non-UTF-8 codes to be loaded into memory. You,
> or algorithms as smart as you, could perhaps be developed to detect
> and manipulate the resulting "lone surrogate" codes in meaningful
> ways, or could simply allow them to ride along without
> interpretation, and be emitted as the original, into other files.
Yes, these are different workarounds.
Oleg.
--
Oleg Broytman
Re: [Python-Dev] Bytes path support
On 8/22/2014 11:50 AM, Oleg Broytman wrote: On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman wrote: On 8/22/2014 9:52 AM, Oleg Broytman wrote: On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman wrote: On 8/22/2014 8:51 AM, Oleg Broytman wrote: What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional. That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings. Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex than strict text/binary dichotomy in Python3. I was not declaring your file not to be a "text file" from any definition obtained from Python3 documentation, just from a common sense definition of "text file". And in my opinion those files are perfect text. The files consist of lines separated by EOL characters (not necessary EOL characters of my OS because it could be a text file produced in a different OS), lines consist of words and words of characters. Until you know or can deduce the encoding of a file, it is binary. If it has multiple, different, embedded encodings of text, it is still binary. In my opinion. So these are just opinions, and naming conventions. If you call it text, you have a different definition of text file than I do. Looking at it from Python3, though, it is clear that when opening a file in "text" mode, an encoding may be specified or will be assumed. That is one encoding, applying to the whole file, not 3 encodings, with declarations on when to switch between them. So I think, in general, Python3 assumes or defines a definition of text file that matches my "common sense" definition. I don't have problems with Python3 text. I have problems with Python3 trying to get rid of byte strings and treating bytes as strict non-text. Python3 is not trying to get rid of byte strings. But to some extent, it is wanting to treat bytes as non-text... bytes can be encoded text, but is not text until it is decoded. There is some processing that can be done on encoded text, but it has to be done differently (in many cases) than processing done on (non-encoded) text. One difference is the interpretation of what character is what varies from encoding to encoding, so if the processing requires understanding the characters, then the character code must be known. On the other hand, if it suffices to detect blocks of opaque text delimited by a known set of delimiters codes (EOL: CR, LF, combinations thereof) then that can be done relatively easily on binary, as long as the encoding doesn't have data puns where a multibyte encoded character might contain the code for the delimiter as one of the bytes of the code for the character. On the other hand, Python3 provides various facilities for working with such files. The first I'll mention is the one that follows from my description of what your file really is: Python3 allows opening files in binary mode, and then decoding various sections of it using whatever encoding you like, using the bytes.decode() operation on various sections of the file. Determination of which sections are in which encodings is beyond the scope of this description of the technique, and is application dependent. This is perhaps the most promising approach. If I can open a text file in binary mode, iterate it line by line, split every line of non-ascii bytes with .split() and process them that'd satisfy my needs. But still there are dragons. If I read a filename from such file I read it as bytes, not str, so I can only use low-level APIs to manipulate with those filenames. Pity. If the file names are in an unknown encoding, both in the directory and in the encoded text in the file listing, then unless you can deduce the encoding, you would be limited to doing manipulations with file APIs that support bytes, the low-level ones, yes. If you can deduce the encoding, then you are freed from that limitation. Let see a perfectly normal situation I am quite often in. A person sent me a directory full of MP3 files. The transport doesn't matter; it could be FTP, or rsync, or a zip file sent by email, or bittorrent. What matters is that filenames and content are in alien encodings. Most often it's cp1251 (the encoding used in Russian Windows) but can be koi8
Re: [Python-Dev] Bytes path support
On Fri, Aug 22, 2014 at 10:09 AM, Glenn Linderman wrote: > What encoding does have a text file (an HTML, to be precise) with > text in utf-8, ads in cp1251 (ad blocks were included from different > files) and comments in koi8-r? >Well, I must admit the HTML was rather an exception, but having a > text file with some strange characters (binary strings, or paragraphs > in different encodings) is not that exceptional. > > That's not a text file. That's a binary file containing (hopefully > delimited, and documented) sections of encoded text in different > encodings. > > Allow me to disagree. For me, this is a text file which I can (and > do) view with a pager, edit with a text editor, list on a console, > search with grep and so on. If it is not a text file by strict Python3 > standards then these standards are too strict for me. Either I find a > simple workaround in Python3 to work with such texts or find a different > tool. I cannot avoid such files because my reality is much more complex > than strict text/binary dichotomy in Python3. > > First -- we're getting OT here -- this thread was about file and path names, not the contents of files. But I suppose I brought that in when I talked about writing file names to files... The first I'll mention is the one that follows from my description of what > your file really is: Python3 allows opening files in binary mode, and then > decoding various sections of it using whatever encoding you like, using the > bytes.decode() operation on various sections of the file. Determination of > which sections are in which encodings is beyond the scope of this > description of the technique, and is application dependent. > right -- and you would have wanted to open such file in binary mode with py2 as well, but in that case, you's have the contents in py2 string object, which has a few more convenient ways to work with text (at least ascii-compatible) than the py3 bytes object does. The third is to specify the UTF-8 with the surrogate escape error handler. > This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as > smart as you, could perhaps be developed to detect and manipulate the > resulting "lone surrogate" codes in meaningful ways, or could simply allow > them to ride along without interpretation, and be emitted as the original, > into other files. > Just so I'm clear here -- if you write that back out, encoded as utf-8 -- you'll get the exact same binary blob out as came in? I wonder if this would make it hard to preserve byte boundaries, though. By the way, IIUC correctly, you can also use the python latin-1 decoder -- anything latin-1 will come through correctly, anything not valid latin-1 will come in as garbage, but if you re-encode with latin-1 the original bytes will be preserved. I think this will also preserve a 1:1 relationship between character count and byte count, which could be handy. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [email protected] ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
On Thu, Aug 21, 2014 at 7:42 PM, Oleg Broytman wrote: > On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal < > [email protected]> wrote: > > This brings up the other key problem. If file names are (almost) > > arbitrary bytes, how do you write one to/read one from a text file > > with a particular encoding? ( or for that matter display it on a > > terminal) > >There is no such thing as an encoding of text files. So we just > write those bytes to the file So I write bytes that are encoded one way into a text file that's encoded another way, and expect to be abel to read that later? you're kidding, right? Only if that's he only thing in the file -- usually not the case with my text files. or output them to the terminal. I often do > that. My filesystems are full of files with names and content in > at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a > terminal with koi8 or utf-8 locale and fonts and some file always look > weird. But however weird they are it's possible to work with them. > Not for me (or many other users) -- terminals are sometimes set with ascii-only encoding, so non-ascii barfs -- or you get some weird control characters that mess up your terminal -- dumping arbitrary bytes to a terminal does not always "just work". > > And people still want to say posix isn't broken in this regard? > >Not at all! And broken or not broken it's what I (for many different > reasons) prefer to use for my desktops, servers, notebooks, routers and > smartphones, Sorry -- that's a Red Herring -- I agree, "broken" or "simple and consistent" is irrelevant, we all want Python to work as well as it can on such systems. The point is that if you are reading a file name from the system, and then passing it back to the system, then you can treat it as just bytes -- who cares? And if you add the byte value of 47 thing, then you can even do basic path manipulations. But once you want to do other things with your file name, then you need to know the encoding. And it is very, very common for users to need to do other things with filenames, and they almost always want them as text that they can read and understand. Python3 supports this case very well. But it does indeed make it hard to work with filenames when you don't know the encoding they are in. And apparently that's pretty common -- or common enough that it would be nice for Python to support it well. This trick is how -- we'd like the "just pass it around and do path manipulations" case to work with (almost) arbitrary bytes, but everything else to work naturally with text (unicode text). Which brings us to the "what APIs should accept bytes" question. I think that's been pretty much answered: All the low-level ones, so that protocol and library programmers can write code that works on systems with undefined filename encodings. But: casual users still need to do the normal things with file names and paths, and ideally those should work the same way on all systems. I think the way to do this is to abstract the path concept, like pathlib does. Back in the day, paths were "just strings", and that worked OK with py2 strings, because you could put arbitrary bytes in them. But the "py2 strings were perfect" folks seem to not acknowledge that while they are nice for matching the posix filename model, they were a pain in the neck when you needed to do somethign else like write them in to a JSON file or something. From my personal experience, non-ascii filenames are much easier to deal with if I use unicode for filenames everywhere (py2). Somehow, I have yet to be bitten by mixed encoding in filenames. So will using a surrogate-escape error handling with pathlib make all this just work? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [email protected] ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman wrote: > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is > utf-8, but it is not both. Maybe you meant "or" instead of "of". I'd assume "or" meant there, rather than "of", it's a common typo. Not sure why 1251, specifically, but it's not uncommon for boundary code to attempt a decode that consists of something like "attempt UTF-8 decode, and if that fails, attempt an eight-bit decode". For my MUD clients, that's pretty much required; one of the servers I frequent is completely bytes-oriented, so whatever encoding one client uses will be dutifully echoed to every other client. There are some that correctly use UTF-8, but others use whatever they feel like; and since those naughty clients are mainly on Windows, I can reasonably guess that they'll be using CP-1252. So that's what I do: UTF-8, fall-back on 1252. (It's also possible some clients will be using Latin-1, but 1252 is a superset of that.) But it's important to note that this is a method of handling junk. It's not a design intention; this is for a situation where I really want to cope with any byte stream and attempt to display it as text. And if I get something that's neither UTF-8 nor CP-1252, I will display it wrongly, and there's nothing can be done about that. ChrisA ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
On Fri, Aug 22, 2014 at 01:17:44PM -0700, Glenn Linderman wrote: > >in cp1251 of utf-8 encoding > > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or > it is utf-8, but it is not both. Maybe you meant "or" instead of > "of". But of course! Oleg. -- Oleg Broytmanhttp://phdru.name/[email protected] Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker wrote: > Back in the day, paths were "just strings", and that worked OK with > py2 strings, because you could put arbitrary bytes in them. But the "py2 > strings were perfect" folks seem to not acknowledge that while they are > nice for matching the posix filename model, they were a pain in the neck > when you needed to do somethign else like write them in to a JSON file or > something. This is the core of the problem. Python2 favors Unix model but Windows people pays the price. Python3 reverses that and I'm still thinking if I want to pay the new price. > So will using a surrogate-escape error handling with pathlib make all this > just work? I'm involved in developing and maintaining a few big commercial projects that will hardly be ported to Python3. So I'm stuck with Python2 for many years and I haven't tried Python3. May be I should try a small personal project, but certainly not this year. May be the next one... Oleg. -- Oleg Broytmanhttp://phdru.name/[email protected] Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico wrote: > On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman > wrote: > > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is > > utf-8, but it is not both. Maybe you meant "or" instead of "of". > > I'd assume "or" meant there, rather than "of", it's a common typo. > > Not sure why 1251, specifically This is the encoding of Russian Windows. Files and emails in Russia are mostly in cp1251 encoding; something like 60-70%, I think. The second popular encoding is cp866 (Russian DOS); it's used by Windows as OEM encoding. Oleg. -- Oleg Broytmanhttp://phdru.name/[email protected] Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
On Sat, Aug 23, 2014 at 8:26 AM, Oleg Broytman wrote: > On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico > wrote: >> On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman >> wrote: >> > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is >> > utf-8, but it is not both. Maybe you meant "or" instead of "of". >> >> I'd assume "or" meant there, rather than "of", it's a common typo. >> >> Not sure why 1251, specifically > >This is the encoding of Russian Windows. Files and emails in Russia > are mostly in cp1251 encoding; something like 60-70%, I think. The > second popular encoding is cp866 (Russian DOS); it's used by Windows as > OEM encoding. Yeah, that makes sense. In any case, you pick one "most likely" 8-bit encoding and go with it. ChrisA ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path support
On Sat, 23 Aug 2014 00:21:18 +0200, Oleg Broytman wrote: >I'm involved in developing and maintaining a few big commercial > projects that will hardly be ported to Python3. So I'm stuck with > Python2 for many years and I haven't tried Python3. May be I should try > a small personal project, but certainly not this year. May be the next > one... Yes, you should try it. Really, it's not the monster you are constructing in your mind. The functions that read filenames and return them as text use surrogate escape to preserve the bytes, and the functions that accept filenames use surrogate escape to recover those bytes before passing them back to the OS. So posix binary filenames just work, as long as the only thing you depend on is being able to split and join them on the / character (and possibly the . character) and otherwise treat the names as black boxes...which is exactly the same situation you are in in python2. If you need to read filenames out of a file, you'll need to specify the surrogate escape error handler so that the bytes will be there to be recovered when you pass them to the file system functions, but it will work. Or, as discussed, you can treat them as binary and use the os level functions that accept binary input (which are exactly the ones you are used to using in python2). This includes os.path.split and os.path.join, which as noted are the only things you can depend on working correctly when you don't know the encoding of the filenames. So, the way to look at this is that python3 is no worse[1] than python2 for handling posix binary filenames, and also provides additional features if you *do* know the correct encoding of the filenames. --David [1] modulo any remaining API bugs, which is exactly where this thread started: trying to figure out which APIs need to be able to handle binary paths and/or surrogate escaped paths so that posix filenames consistently work as well in python3 as they did in python2). ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
