Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

norseman Thu, 30 Apr 2009 16:03:47 -0700

Martin v. Löwis wrote:

How do get a printable unicode version of these path strings if they
contain none unicode data?

Define "printable". One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.

What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.


I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?


You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
  return ""

This will always return a printable version of the input string...


No it will not.
It will return either nothing at all or a '\x00' depending on how a NULL

is treated. Neither prints on paper, screen or any where else. If youget the cases where all bytes are not translating or printable locallythen you get nothing out. Duplicate file names usually abound too.

In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.


That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.


Which seems to be exactly what he's trying to do.

Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?


If no one has an 'elegant' solution, toss PEP and do what has to be
done.  I find the classroom is seldom related to reality.

Do I end up using the byte interface and doing the utf-8 decode
myself?


No, you should encode using the "strict" error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.


Exactly his problem to solve. How does he fix the broken????


Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


====================================

Barry;
        First: See if the sender(s) will use a different "font". :)
        I would suggest you read raw bytes and handle the problem in
the usual logical way. (Translate what you can, if it looks readable
keep it otherwise send it back if possible.)  If you have to keep a
junked up name, try using a thesaurus or soundex (I know I spelled that
wrong) to help keep the meaning/sound of the file name.  If the name is
one of those computer generated gobbeldigoops - build a translation
table to use for incoming and for getting back to original bit patterns

later. Your name won't be the same but ... Plug it into that handyutility you just wrote and you can talk much more effectively with sender.


        If you can get the page-thingy (CP-1251 or whatever) specs you
can be well ahead of the game.  There are programs out there that will

convert (better or lessor) between page specs. Some work in-line.Watch out for Python's print function not being completely compatiblewith reality. The high bit bytes in ASCII have been in use for quitesome time and are (or at least supposed to be) part of the page to pagespec translations. You probably will need to know (or make a closeguess) of the 'from' language to get plausible results. If the filesare coming across the Pacific it might be a good time to form acollaboration. (a case of: we agree that 'that' bit pattern in yourfilename will become 'this' in ours. Reversal required, as in A becomesC incoming and C becomes A outgoing.)


Note:  Different machines store things differently. Intel stores High
byte last, Sun stores it first. It can be handy to know the machinery.
Net transport programs are supposed to send Sun order, not all do.




Steve

--
http://mail.python.org/mailman/listinfo/python-list

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to