Re: [Python-Dev] Bytes path support

2014-08-22 Thread Steven D'Aprano
On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote:
> On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal 
>  wrote:
> > This brings up the other key problem. If file names are (almost)
> > arbitrary bytes, how do you write one to/read one from a text file
> > with a particular encoding? ( or for that matter display it on a
> > terminal)
> 
>There is no such thing as an encoding of text files.

I don't understand this comment. It seems to me that *text* files have 
to have an encoding, otherwise you can't interpret the contents as text. 
Files, of course, only contain bytes, but to be treated as bytes you 
need some way of transforming byte N to char C (or multiple bytes to C), 
which is an encoding.

Perhaps you just mean that encodings are not recorded in the text file 
itself?

To answer Chris' question, you typically cannot include arbitrary 
bytes in text files, and displaying them to the user is likewise 
problematic. The usual solution is to support some form of 
escaping, like \t #x0A; or %0D, to give a few examples.


-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Martin v. Löwis
Am 22.08.14 01:56, schrieb Glenn Linderman:
> 0 and 47 are certainly originally derived from ASCII.  However, there
> could be lots of encodings that are not ASCII compatible (but in
> practice, probably very few, since most encodings _are_ ASCII
> compatible) that could be fit those constraints.
> 
> So while as a technical matter, Cameron is correct that Unix only treats
> 0 & 47 as special, and that is insufficient to declare that encodings
> must be ASCII compatible, as a practical matter, since most encodings
> are ASCII compatible anyway, it would be hard to find very many that
> could be used successfully with Unix file names that are not ASCII
> compatible, that could comply with the 0 & 47 requirements.

More importantly, existing encodings that are distinctively *not*
ASCII compatible (e.g. the EBCDIC ones) do not put the slash into 47
(instead, it is at 91 at EBCDIC, 47 is the BEL control character).

There are boundary cases, of course. VISCII is "mostly ASCII
compatible", putting graphic characters into some of the control
characters, but using those that aren't used in ASCII, anyway.

And then there is the YUSCII family of encodings, which definitely
is not ASCII compatible, as it does not contain Latin characters,
but still puts the / into 47 (and also keeps the ASCII digits and
special characters in their positions). There is also SI 960, which
has the slash, the ASCII uppercase letters, digits and special
characters, but replaces the lower-case characters with Hebrew.

So yes, Unix doesn't mandate ASCII-compatible encodings; but it
still mandates ASCII-inspired encodings. I wonder how you would
run "gcc", though, on an SI 960 system; you'ld have to type
חדד.

Regards,
Martin

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
Hi!

On Sat, Aug 23, 2014 at 01:19:14AM +1000, Steven D'Aprano  
wrote:
> On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote:
> > On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal 
> >  wrote:
> > > This brings up the other key problem. If file names are (almost)
> > > arbitrary bytes, how do you write one to/read one from a text file
> > > with a particular encoding? ( or for that matter display it on a
> > > terminal)
> > 
> >There is no such thing as an encoding of text files.
> 
> I don't understand this comment. It seems to me that *text* files have 
> to have an encoding, otherwise you can't interpret the contents as text. 

   What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
   Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.

> Files, of course, only contain bytes, but to be treated as bytes you 
> need some way of transforming byte N to char C (or multiple bytes to C), 
> which is an encoding.

   But you don't need to treat the entire file in one encoding. Strange
characters are clearly visible so you can interpret them differently. I
am very much trained to distinguish koi8, cp1251 and utf-8 texts; I
cannot translate them mentally but I can recognize them.

> Perhaps you just mean that encodings are not recorded in the text file 
> itself?

   Yes, that too.

> To answer Chris' question, you typically cannot include arbitrary 
> bytes in text files, and displaying them to the user is likewise 
> problematic

   As a person who view utf-8 files in koi8 fonts (and vice versa) every
day I'd argue. (-:

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/[email protected]
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Summary of Python tracker Issues

2014-08-22 Thread Python tracker

ACTIVITY SUMMARY (2014-08-15 - 2014-08-22)
Python tracker at http://bugs.python.org/

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open4621 (+19)
  closed 29399 (+28)
  total  34020 (+47)

Open issues with patches: 2179 


Issues opened (41)
==

#22207: Test for integer overflow on Py_ssize_t: explicitly cast to si
http://bugs.python.org/issue22207  opened by haypo

#22208: tarfile can't add in memory files (reopened)
http://bugs.python.org/issue22208  opened by markgrandi

#22209: Idle: add better access to extension information
http://bugs.python.org/issue22209  opened by terry.reedy

#22210: pdb-run-restarting-a-pdb-session
http://bugs.python.org/issue22210  opened by zhengxiexie

#22211: Remove VMS specific code in expat.h  & xmlrole.h
http://bugs.python.org/issue22211  opened by John.Malmberg

#22212: zipfile.py fails if  zlib.so module fails to build.
http://bugs.python.org/issue22212  opened by John.Malmberg

#22213: pyvenv style virtual environments unusable in an embedded syst
http://bugs.python.org/issue22213  opened by grahamd

#22214: Tkinter: Don't stringify callbacks arguments
http://bugs.python.org/issue22214  opened by serhiy.storchaka

#22215: "embedded NUL character" exceptions
http://bugs.python.org/issue22215  opened by serhiy.storchaka

#22216: smtplip STARTTLS fails at second attampt due to unsufficiant q
http://bugs.python.org/issue22216  opened by zvyn

#22217: Reprs for zipfile classes
http://bugs.python.org/issue22217  opened by serhiy.storchaka

#22218: Fix more compiler warnings "comparison between signed and unsi
http://bugs.python.org/issue22218  opened by haypo

#22219: python -mzipfile fails to add empty folders to created zip
http://bugs.python.org/issue22219  opened by Antony.Lee

#0: Ttk extensions test failure
http://bugs.python.org/issue0  opened by serhiy.storchaka

#1: ast.literal_eval confused by coding declarations
http://bugs.python.org/issue1  opened by jorgenschaefer

#2: dtoa.c: remove custom memory allocator
http://bugs.python.org/issue2  opened by haypo

#3: argparse not including '--' arguments in previous optional REM
http://bugs.python.org/issue3  opened by Jurko.Gospodnetić

#5: Add SQLite support to http.cookiejar
http://bugs.python.org/issue5  opened by demian.brecht

#6: Refactor dict result handling in Tkinter
http://bugs.python.org/issue6  opened by serhiy.storchaka

#7: Simplify tarfile iterator
http://bugs.python.org/issue7  opened by serhiy.storchaka

#8: Adapt bash readline operate-and-get-next function
http://bugs.python.org/issue8  opened by lelit

#9: wsgiref doesn't appear to ever set REMOTE_HOST in the environ
http://bugs.python.org/issue9  opened by alex

#22231: httplib: unicode url will cause an ascii codec error when comb
http://bugs.python.org/issue22231  opened by Bob.Chen

#22232: str.splitlines splitting on none-\r\n characters
http://bugs.python.org/issue22232  opened by scharron

#22233: http.client splits headers on none-\r\n characters
http://bugs.python.org/issue22233  opened by scharron

#22234: urllib.parse.urlparse accepts any falsy value as an url
http://bugs.python.org/issue22234  opened by Ztane

#22235: httplib: TypeError with file() object in ssl.py
http://bugs.python.org/issue22235  opened by erob

#22236: Do not use _default_root in Tkinter tests
http://bugs.python.org/issue22236  opened by serhiy.storchaka

#22237: sorted() docs should state that the sort is stable
http://bugs.python.org/issue22237  opened by Wilfred.Hughes

#22239: asyncio: nested event loop
http://bugs.python.org/issue22239  opened by djarb

#22240: argparse support for "python -m module" in help
http://bugs.python.org/issue22240  opened by tebeka

#22241: strftime/strptime round trip fails even for UTC datetime objec
http://bugs.python.org/issue22241  opened by akira

#22242: Doc fix in the Import section in language reference.
http://bugs.python.org/issue22242  opened by jon.poler

#22243: Documentation on try statement incorrectly implies target of e
http://bugs.python.org/issue22243  opened by mwilliamson

#22244: load_verify_locations fails to handle unicode paths on Python 
http://bugs.python.org/issue22244  opened by alex

#22246: add strptime(s, '%s')
http://bugs.python.org/issue22246  opened by akira

#22247: More incomplete module.__all__ lists
http://bugs.python.org/issue22247  opened by vadmium

#22248: urllib.request.urlopen raises exception when 30X-redirect url 
http://bugs.python.org/issue22248  opened by tomasgroth

#22249: Possibly incorrect example is given for socket.getaddrinfo()
http://bugs.python.org/issue22249  opened by Alexander.Patrakov

#22250: unittest lowercase methods
http://bugs.python.org/issue22250  opened by simonzack

#22251: Various markup errors in documentation
http://bugs.python.org/issue22251  opened by berker.peksag



Most rece

Re: [Python-Dev] Bytes path support

2014-08-22 Thread Glenn Linderman

On 8/22/2014 8:51 AM, Oleg Broytman wrote:

What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully 
delimited, and documented) sections of encoded text in different encodings.


If it is named .html and served by the server as UTF-8, then the server 
is misconfigured, or the file is incorrectly populated.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
 wrote:
> On 8/22/2014 8:51 AM, Oleg Broytman wrote:
> >What encoding does have a text file (an HTML, to be precise) with
> >text in utf-8, ads in cp1251 (ad blocks were included from different
> >files) and comments in koi8-r?
> >Well, I must admit the HTML was rather an exception, but having a
> >text file with some strange characters (binary strings, or paragraphs
> >in different encodings) is not that exceptional.
> That's not a text file. That's a binary file containing (hopefully
> delimited, and documented) sections of encoded text in different
> encodings.

   Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/[email protected]
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Glenn Linderman

On 8/22/2014 9:52 AM, Oleg Broytman wrote:

On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
 wrote:

On 8/22/2014 8:51 AM, Oleg Broytman wrote:

What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.

That's not a text file. That's a binary file containing (hopefully
delimited, and documented) sections of encoded text in different
encodings.

Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.

Oleg.


I was not declaring your file not to be a "text file" from any 
definition obtained from Python3 documentation, just from a common sense 
definition of "text file".


Looking at it from Python3, though, it is clear that when opening a file 
in "text" mode, an encoding may be specified or will be assumed.  That 
is one encoding, applying to the whole file, not 3 encodings, with 
declarations on when to switch between them. So I think, in general, 
Python3 assumes or defines a definition of text file that matches my 
"common sense" definition. Also, if it is an HTML file, I doubt the 
browser will use multiple different encodings when interpreting it, so 
it is not clear that the file is of practical use for its intended 
purpose if it contains text in multiple different encodings, but is 
served using only a single encoding, unless there is javascript or some 
programming in the browser that reencodes the data.


On the other hand, Python3 provides various facilities for working with 
such files.


The first I'll mention is the one that follows from my description of 
what your file really is: Python3 allows opening files in binary mode, 
and then decoding various sections of it using whatever encoding you 
like, using the bytes.decode() operation on various sections of the 
file. Determination of which sections are in which encodings is beyond 
the scope of this description of the technique, and is application 
dependent.


The second is to specify an error handler, that, like you, is trained to 
recognize the other encodings and convert them appropriately. I'm not 
aware that such an error handler has been or could be written, myself 
not having your training.


The third is to specify the UTF-8 with the surrogate escape error 
handler. This allows non-UTF-8 codes to be loaded into memory. You, or 
algorithms as smart as you, could perhaps be developed to detect and 
manipulate the resulting "lone surrogate" codes in meaningful ways, or 
could simply allow them to ride along without interpretation, and be 
emitted as the original, into other files.


There may be other technique that I am not aware of.

Glenn
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman 
 wrote:
> On 8/22/2014 9:52 AM, Oleg Broytman wrote:
> >On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
> > wrote:
> >>On 8/22/2014 8:51 AM, Oleg Broytman wrote:
> >>>What encoding does have a text file (an HTML, to be precise) with
> >>>text in utf-8, ads in cp1251 (ad blocks were included from different
> >>>files) and comments in koi8-r?
> >>>Well, I must admit the HTML was rather an exception, but having a
> >>>text file with some strange characters (binary strings, or paragraphs
> >>>in different encodings) is not that exceptional.
> >>That's not a text file. That's a binary file containing (hopefully
> >>delimited, and documented) sections of encoded text in different
> >>encodings.
> >Allow me to disagree. For me, this is a text file which I can (and
> >do) view with a pager, edit with a text editor, list on a console,
> >search with grep and so on. If it is not a text file by strict Python3
> >standards then these standards are too strict for me. Either I find a
> >simple workaround in Python3 to work with such texts or find a different
> >tool. I cannot avoid such files because my reality is much more complex
> >than strict text/binary dichotomy in Python3.
> 
> I was not declaring your file not to be a "text file" from any
> definition obtained from Python3 documentation, just from a common
> sense definition of "text file".

   And in my opinion those files are perfect text. The files consist of
lines separated by EOL characters (not necessary EOL characters of my OS
because it could be a text file produced in a different OS), lines
consist of words and words of characters.

> Looking at it from Python3, though, it is clear that when opening a
> file in "text" mode, an encoding may be specified or will be
> assumed.  That is one encoding, applying to the whole file, not 3
> encodings, with declarations on when to switch between them. So I
> think, in general, Python3 assumes or defines a definition of text
> file that matches my "common sense" definition.

   I don't have problems with Python3 text. I have problems with Python3
trying to get rid of byte strings and treating bytes as strict non-text.

> On the other hand, Python3 provides various facilities for working
> with such files.
> 
> The first I'll mention is the one that follows from my description
> of what your file really is: Python3 allows opening files in binary
> mode, and then decoding various sections of it using whatever
> encoding you like, using the bytes.decode() operation on various
> sections of the file. Determination of which sections are in which
> encodings is beyond the scope of this description of the technique,
> and is application dependent.

   This is perhaps the most promising approach. If I can open a text
file in binary mode, iterate it line by line, split every line of
non-ascii bytes with .split() and process them that'd satisfy my needs.
   But still there are dragons. If I read a filename from such file I
read it as bytes, not str, so I can only use low-level APIs to
manipulate with those filenames. Pity.

   Let see a perfectly normal situation I am quite often in. A person
sent me a directory full of MP3 files. The transport doesn't matter; it
could be FTP, or rsync, or a zip file sent by email, or bittorrent. What
matters is that filenames and content are in alien encodings. Most often
it's cp1251 (the encoding used in Russian Windows) but can be koi8 or
utf8. There is a playlist among the files -- a text file that lists MP3
files, every file on a single line; usually with full paths
("C:\Audio\some.mp3").
   Now I want to read filenames from the file and process the filenames
(strip paths) and files (verify existing of files, or renumber the files
or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are
also in cp1251 of utf-8 encoding]...whatever). I don't know the encoding
of the playlist but I know it corresponds to the encoding of filenames
so I can expect those files exist on my filesystem; they have strangely
looking unreadable names but they exist.
   Just a small example of why I do want to process filenames from a
text file in an alien encoding. Without knowing the encoding in advance.

> The second is to specify an error handler, that, like you, is
> trained to recognize the other encodings and convert them
> appropriately. I'm not aware that such an error handler has been or
> could be written, myself not having your training.
> 
> The third is to specify the UTF-8 with the surrogate escape error
> handler. This allows non-UTF-8 codes to be loaded into memory. You,
> or algorithms as smart as you, could perhaps be developed to detect
> and manipulate the resulting "lone surrogate" codes in meaningful
> ways, or could simply allow them to ride along without
> interpretation, and be emitted as the original, into other files.

   Yes, these are different workarounds.

Oleg.
-- 
 Oleg Broytman

Re: [Python-Dev] Bytes path support

2014-08-22 Thread Glenn Linderman

On 8/22/2014 11:50 AM, Oleg Broytman wrote:

On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman 
 wrote:

On 8/22/2014 9:52 AM, Oleg Broytman wrote:

On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
 wrote:

On 8/22/2014 8:51 AM, Oleg Broytman wrote:

What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.

That's not a text file. That's a binary file containing (hopefully
delimited, and documented) sections of encoded text in different
encodings.

Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.

I was not declaring your file not to be a "text file" from any
definition obtained from Python3 documentation, just from a common
sense definition of "text file".

And in my opinion those files are perfect text. The files consist of
lines separated by EOL characters (not necessary EOL characters of my OS
because it could be a text file produced in a different OS), lines
consist of words and words of characters.


Until you know or can deduce the encoding of a file, it is binary. If it 
has multiple, different, embedded encodings of text, it is still binary. 
In my opinion. So these are just opinions, and naming conventions. If 
you call it text, you have a different definition of text file than I do.





Looking at it from Python3, though, it is clear that when opening a
file in "text" mode, an encoding may be specified or will be
assumed.  That is one encoding, applying to the whole file, not 3
encodings, with declarations on when to switch between them. So I
think, in general, Python3 assumes or defines a definition of text
file that matches my "common sense" definition.

I don't have problems with Python3 text. I have problems with Python3
trying to get rid of byte strings and treating bytes as strict non-text.


Python3 is not trying to get rid of byte strings. But to some extent, it 
is wanting to treat bytes as non-text... bytes can be encoded text, but 
is not text until it is decoded. There is some processing that can be 
done on encoded text, but it has to be done differently (in many cases) 
than processing done on (non-encoded) text.


One difference is the interpretation of what character is what varies 
from encoding to encoding, so if the processing requires understanding 
the characters, then the character code must be known.


On the other hand, if it suffices to detect blocks of opaque text 
delimited by a known set of delimiters codes (EOL: CR, LF, combinations 
thereof) then that can be done relatively easily on binary, as long as 
the encoding doesn't have data puns where a multibyte encoded character 
might contain the code for the delimiter as one of the bytes of the code 
for the character.



On the other hand, Python3 provides various facilities for working
with such files.

The first I'll mention is the one that follows from my description
of what your file really is: Python3 allows opening files in binary
mode, and then decoding various sections of it using whatever
encoding you like, using the bytes.decode() operation on various
sections of the file. Determination of which sections are in which
encodings is beyond the scope of this description of the technique,
and is application dependent.

This is perhaps the most promising approach. If I can open a text
file in binary mode, iterate it line by line, split every line of
non-ascii bytes with .split() and process them that'd satisfy my needs.
But still there are dragons. If I read a filename from such file I
read it as bytes, not str, so I can only use low-level APIs to
manipulate with those filenames. Pity.


If the file names are in an unknown encoding, both in the directory and 
in the encoded text in the file listing, then unless you can deduce the 
encoding, you would be limited to doing manipulations with file APIs 
that support bytes, the low-level ones, yes.  If you can deduce the 
encoding, then you are freed from that limitation.



Let see a perfectly normal situation I am quite often in. A person
sent me a directory full of MP3 files. The transport doesn't matter; it
could be FTP, or rsync, or a zip file sent by email, or bittorrent. What
matters is that filenames and content are in alien encodings. Most often
it's cp1251 (the encoding used in Russian Windows) but can be koi8 

Re: [Python-Dev] Bytes path support

2014-08-22 Thread Chris Barker
On Fri, Aug 22, 2014 at 10:09 AM, Glenn Linderman 
wrote:

> What encoding does have a text file (an HTML, to be precise) with
> text in utf-8, ads in cp1251 (ad blocks were included from different
> files) and comments in koi8-r?
>Well, I must admit the HTML was rather an exception, but having a
> text file with some strange characters (binary strings, or paragraphs
> in different encodings) is not that exceptional.
>
>  That's not a text file. That's a binary file containing (hopefully
> delimited, and documented) sections of encoded text in different
> encodings.
>
> Allow me to disagree. For me, this is a text file which I can (and
> do) view with a pager, edit with a text editor, list on a console,
> search with grep and so on. If it is not a text file by strict Python3
> standards then these standards are too strict for me. Either I find a
> simple workaround in Python3 to work with such texts or find a different
> tool. I cannot avoid such files because my reality is much more complex
> than strict text/binary dichotomy in Python3.
>
> First -- we're getting OT here -- this thread was about file and path
names, not the contents of files. But I suppose I brought that in when I
talked about writing file names to files...

The first I'll mention is the one that follows from my description of what
> your file really is: Python3 allows opening files in binary mode, and then
> decoding various sections of it using whatever encoding you like, using the
> bytes.decode() operation on various sections of the file. Determination of
> which sections are in which encodings is beyond the scope of this
> description of the technique, and is application dependent.
>

right -- and you would have wanted to open such file in binary mode with
py2 as well, but in that case, you's have the contents in py2 string
object, which has a few more convenient ways to work with text (at least
ascii-compatible) than the py3 bytes object does.

The third is to specify the UTF-8 with the surrogate escape error handler.
> This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as
> smart as you, could perhaps be developed to detect and manipulate the
> resulting "lone surrogate" codes in meaningful ways, or could simply allow
> them to ride along without interpretation, and be emitted as the original,
> into other files.
>

Just so I'm clear here -- if you write that back out, encoded as utf-8 --
you'll get the exact same binary blob out as came in?

I wonder if this would make it hard to preserve byte boundaries, though.

By the way, IIUC correctly, you can also use the python latin-1 decoder --
anything latin-1 will come through correctly, anything not valid latin-1
will come in as garbage, but if you re-encode with latin-1 the original
bytes will be preserved. I think this will also preserve a 1:1 relationship
between character count and byte count, which could be handy.

-Chris



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[email protected]
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Chris Barker
On Thu, Aug 21, 2014 at 7:42 PM, Oleg Broytman  wrote:

> On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal <
> [email protected]> wrote:
> > This brings up the other key problem. If file names are (almost)
> > arbitrary bytes, how do you write one to/read one from a text file
> > with a particular encoding? ( or for that matter display it on a
> > terminal)
>
>There is no such thing as an encoding of text files. So we just
> write those bytes to the file


So I write bytes that are encoded one way into a text file that's encoded
another way, and expect to be abel to read that later? you're kidding,
right? Only if that's  he only thing in the file -- usually not the case
with my text files.

or output them to the terminal. I often do
> that. My filesystems are full of files with names and content in
> at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a
> terminal with koi8 or utf-8 locale and fonts and some file always look
> weird. But however weird they are it's possible to work with them.
>

Not for me (or many other users) -- terminals are sometimes set with
ascii-only encoding, so non-ascii barfs -- or you get some weird control
characters that mess up your terminal -- dumping arbitrary bytes to a
terminal does not always "just work".


> > And people still want to say posix isn't broken in this regard?
>
>Not at all! And broken or not broken it's what I (for many different
> reasons) prefer to use for my desktops, servers, notebooks, routers and
> smartphones,


Sorry -- that's a Red Herring -- I agree, "broken" or "simple and
consistent" is irrelevant, we all want Python to work as well as it can on
such systems.

The point is that if you are reading a file name from the system, and then
passing it back to the system, then you can treat it as just bytes -- who
cares? And if you add the byte value of 47 thing, then you can even do
basic path manipulations. But once you want to do other things with your
file name, then you need to know the encoding. And it is very, very common
for users to need to do other things with filenames, and they almost always
want them as text that they can read and understand.

Python3 supports this case very well. But it does indeed make it hard to
work with filenames when you don't know the encoding they are in. And
apparently that's pretty common -- or common enough that it would be nice
for Python to support it well. This trick is how -- we'd like the "just
pass it around and do path manipulations" case to work with (almost)
arbitrary bytes, but everything else to work naturally with text (unicode
text).

Which brings us to the "what APIs should accept bytes" question. I think
that's been pretty much answered: All the low-level ones, so that protocol
and library programmers can write code that works on systems with undefined
filename encodings.

But: casual users still need to do the normal things with file names and
paths, and ideally those should work the same way on all systems.

I think the way to do this is to abstract the path concept, like pathlib
does. Back in the day, paths were "just strings", and that worked OK with
py2 strings, because you could put arbitrary bytes in them. But the "py2
strings were perfect" folks seem to not acknowledge that while they are
nice for matching the posix filename model, they were a pain in the neck
when you needed to do somethign else like write them in to a JSON file or
something. From my personal experience, non-ascii filenames are much easier
to deal with if I use unicode for filenames everywhere (py2). Somehow, I
have yet to be bitten by mixed encoding in filenames.

So will using a surrogate-escape error handling with pathlib make all this
just work?

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[email protected]
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Chris Angelico
On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman  wrote:
> "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is
> utf-8, but it is not both. Maybe you meant "or" instead of "of".

I'd assume "or" meant there, rather than "of", it's a common typo.

Not sure why 1251, specifically, but it's not uncommon for boundary
code to attempt a decode that consists of something like "attempt
UTF-8 decode, and if that fails, attempt an eight-bit decode". For my
MUD clients, that's pretty much required; one of the servers I
frequent is completely bytes-oriented, so whatever encoding one client
uses will be dutifully echoed to every other client. There are some
that correctly use UTF-8, but others use whatever they feel like; and
since those naughty clients are mainly on Windows, I can reasonably
guess that they'll be using CP-1252. So that's what I do: UTF-8,
fall-back on 1252. (It's also possible some clients will be using
Latin-1, but 1252 is a superset of that.)

But it's important to note that this is a method of handling junk.
It's not a design intention; this is for a situation where I really
want to cope with any byte stream and attempt to display it as text.
And if I get something that's neither UTF-8 nor CP-1252, I will
display it wrongly, and there's nothing can be done about that.

ChrisA
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
On Fri, Aug 22, 2014 at 01:17:44PM -0700, Glenn Linderman 
 wrote:
> >in cp1251 of utf-8 encoding
> 
> "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or
> it is utf-8, but it is not both. Maybe you meant "or" instead of
> "of".

   But of course!

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/[email protected]
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker  
wrote:
> Back in the day, paths were "just strings", and that worked OK with
> py2 strings, because you could put arbitrary bytes in them. But the "py2
> strings were perfect" folks seem to not acknowledge that while they are
> nice for matching the posix filename model, they were a pain in the neck
> when you needed to do somethign else like write them in to a JSON file or
> something.

   This is the core of the problem. Python2 favors Unix model but
Windows people pays the price. Python3 reverses that and I'm still
thinking if I want to pay the new price.

> So will using a surrogate-escape error handling with pathlib make all this
> just work?

   I'm involved in developing and maintaining a few big commercial
projects that will hardly be ported to Python3. So I'm stuck with
Python2 for many years and I haven't tried Python3. May be I should try
a small personal project, but certainly not this year. May be the next
one...

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/[email protected]
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico  
wrote:
> On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman  
> wrote:
> > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is
> > utf-8, but it is not both. Maybe you meant "or" instead of "of".
> 
> I'd assume "or" meant there, rather than "of", it's a common typo.
> 
> Not sure why 1251, specifically

   This is the encoding of Russian Windows. Files and emails in Russia
are mostly in cp1251 encoding; something like 60-70%, I think. The
second popular encoding is cp866 (Russian DOS); it's used by Windows as
OEM encoding.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/[email protected]
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Chris Angelico
On Sat, Aug 23, 2014 at 8:26 AM, Oleg Broytman  wrote:
> On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico  
> wrote:
>> On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman  
>> wrote:
>> > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is
>> > utf-8, but it is not both. Maybe you meant "or" instead of "of".
>>
>> I'd assume "or" meant there, rather than "of", it's a common typo.
>>
>> Not sure why 1251, specifically
>
>This is the encoding of Russian Windows. Files and emails in Russia
> are mostly in cp1251 encoding; something like 60-70%, I think. The
> second popular encoding is cp866 (Russian DOS); it's used by Windows as
> OEM encoding.

Yeah, that makes sense. In any case, you pick one "most likely" 8-bit
encoding and go with it.

ChrisA
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread R. David Murray
On Sat, 23 Aug 2014 00:21:18 +0200, Oleg Broytman  wrote:
>I'm involved in developing and maintaining a few big commercial
> projects that will hardly be ported to Python3. So I'm stuck with
> Python2 for many years and I haven't tried Python3. May be I should try
> a small personal project, but certainly not this year. May be the next
> one...

Yes, you should try it.  Really, it's not the monster you are
constructing in your mind.  The functions that read filenames and return
them as text use surrogate escape to preserve the bytes, and the
functions that accept filenames use surrogate escape to recover those
bytes before passing them back to the OS.  So posix binary filenames
just work, as long as the only thing you depend on is being able to
split and join them on the / character (and possibly the . character)
and otherwise treat the names as black boxes...which is exactly the same
situation you are in in python2.

If you need to read filenames out of a file, you'll need to specify the
surrogate escape error handler so that the bytes will be there to be
recovered when you pass them to the file system functions, but it will
work.

Or, as discussed, you can treat them as binary and use the os level
functions that accept binary input (which are exactly the ones you are
used to using in python2).  This includes os.path.split and
os.path.join, which as noted are the only things you can depend on
working correctly when you don't know the encoding of the filenames.

So, the way to look at this is that python3 is no worse[1] than python2 for
handling posix binary filenames, and also provides additional features
if you *do* know the correct encoding of the filenames.

--David

[1] modulo any remaining API bugs, which is exactly where this thread
started: trying to figure out which APIs need to be able to handle
binary paths and/or surrogate escaped paths so that posix filenames
consistently work as well in python3 as they did in python2).
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com