Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Walter Dörwald
On 09.01.10 01:47, Glenn Linderman wrote:

> On approximately 1/8/2010 3:59 PM, came the following characters from
> the keyboard of Victor Stinner:
>> Hi,
>>
>> Thanks for all the answers! I will try to sum up all ideas here.
> 
> One concern I have with this implementation encoding="BOM" is that if
> there is no BOM it assumes UTF-8.  That is probably a good assumption in
> some circumstances, but not in others.
> 
> * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE
> encoded files include a BOM.  It is only required that UTF-16 and UTF-32
> (cases where the endianness is unspecified) contain a BOM.  Hence, it
> might be that someone would expect a UTF-16LE (or any of the formats
> that don't require a BOM, rather than UTF-8), but be willing to accept
> any BOM-discriminated format.
> 
> * Potentially, this could be expanded beyond the various Unicode
> encodings... one could envision that a program whose data files
> historically were in any particular national language locale, could want
> to be enhance to accept Unicode, and could declare that they will accept
> any BOM-discriminated format, but want to default, in the absence of a
> BOM, to the original national language locale that they historically
> accepted.  That would provide a migration path for their old data files.
> 
> So the point is, that it might be nice to have
> "BOM-otherEncodingForDefault" for each other encoding that Python
> supports.  Not sure that is the right API, but I think it is expressive
> enough to handle the cases above.  Whether the cases solve actual
> problems or not, I couldn't say, but they seem like reasonable cases.

This is doable with the currect API. Simply define a codec search
function that handles all encoding names that start with "BOM-" and pass
the "otherEncodingForDefault" part along to the codec.

> It would, of course, be nicest if OS metadata had been invented way back
> when, for all OSes, such that all text files were flagged with their
> encoding... then languages could just read the encoding and do the right
> thing! But we live in the real world, instead.

Servus,
   Walter
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Walter Dörwald

Victor Stinner wrote:

Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit :

Builtin open() function is unable to open an UTF-16/32 file starting with
a BOM if the encoding is not specified (raise an unicode error). For an
UTF-8 file starting with a BOM, read()/readline() returns also the BOM
whereas the BOM should be "ignored".

It depends. If you use the utf-8-sig encoding, it *will* ignore the
UTF-8 signature.


Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and 
UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to 
remove the BOM after the first read (much harder if you use a module like 
ConfigParser or csv).



Since my proposition changes the result TextIOWrapper.read()/readline()
for files starting with a BOM, we might introduce an option to open() to
enable the new behaviour. But is it really needed to keep the backward
compatibility?

Absolutely. And there is no need to produce a new option, but instead
use the existing options: define an encoding that auto-detects the
encoding from the family of BOMs. Maybe you call it encoding="sniff".


Good idea, I choosed open(filename, encoding="BOM").


On the surface this looks like there's an encoding named "BOM", but 
looking at your patch I found that the check is still done in 
TextIOWrapper. IMHO the best approach would to the implement a *real* 
codec named "BOM" (or "sniff"). This doesn't require *any* changes to 
the IO library. It could even be developed as a standalone project and 
published in the Cheeseshop.


To see how something like this can be done, take a look at the UTF-16 
codec, that switches to bigendian or littleendian mode depending on the 
first read/decode call.


Servus,
   Walter





___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Victor Stinner
Le samedi 09 janvier 2010 02:23:07, Martin v. Löwis a écrit :
> While I would support combining BOM detection in the case where a file
> is opened for reading and no encoding is specified, I see two problems:
> a) if a seek operations is performed before having looked at the BOM,
>no determination would have been made

TextIOWrapper doesn't support seek to an arbitrary byte. It uses "cookie" 
which is an opaque value. Reuse a cookie from another file or an old cookie is 
forbidden (but it doesn't raise an error). This is not specific to the BOM 
checking: the problem already exist for encodings using a BOM (eg. UTF-16).

> b) what encoding should it use on writing?

Don't change anything to writing.

With Antoince choice: open('file.txt', 'w', encoding=None) continue to use the 
actual heuristic (os.device_encoding() or system locale).

With Guido choice, encoding="BOM": it raises an error, because BOM check is 
not supported when writing into a file. How could the BOM be checked when 
creating a new (empty) file!?

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread M.-A. Lemburg
Victor Stinner wrote:
> (2) Check for a BOM while reading or detect it before?
> 
> Everybody agree that checking BOM is an interesting option and should not be 
> limited to open().
> 
> Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file 
> name or a binary file-like object: it returns the encoding and seek to the 
> file start or just after the BOM.
> 
> I dislike this function because it requires extra file operations (open 
> (optional), read() and seek()) and it doesn't work if the file is not 
> seekable 
> (eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to 
> avoid extra file operations.
> 
> Note: I implemented the BOM check in TextIOWrapper; so it's already usable 
> for 
> any file-like object.

Yes, but the implementation is limited to just BOM checking
and thus only supports UTF-8-SIG, UTF-16 and UTF-32.

With a codecs module function we could easily extend the
encoding detection to more file types, e.g. XML files,
Python source code files, etc. that use other mechanisms
for defining the encoding.

BTW: I haven't looked at your implementation, but what happens
when your BOM check fails ? Will the implementation add the
already read bytes back to a buffer ?

This rollback action is the only reason for needing a
seekable stream in codecs.guess_stream_encoding().

Another point to consider:

AFAIK, we currently have a moratorium on changes to Python
builtins. How does that match up with the proposed changes ?

Using a new codec like Walter suggested would move the
implementation into the stdlib for which doesn't the
moratorium doesn't apply.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 09 2010)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Victor Stinner
Le samedi 09 janvier 2010 01:47:38, vous avez écrit :
> One concern I have with this implementation encoding="BOM" is that if
> there is no BOM it assumes UTF-8.

If no BOM is found, it fallback to the current heuristic: os.device_encoding() 
or system local.

> (...) Hence, it might be that someone would expect a UTF-16LE (or any of 
> the formats that don't require a BOM, rather than UTF-8), but be willing 
> to accept any BOM-discriminated format.
> (...) declare that they will accept
> any BOM-discriminated format, but want to default, in the absence of a
> BOM, to the original national language locale that they historically
> accepted

You mean "if there is a BOM, use it, otherwise fallback to a specific 
charset"? How could it be declared? Maybe:

   open("file.txt", check_bom=True, encoding="UTF16-LE")
   open("file.txt", check_bom=True, encoding="latin1")

About falling back to UTF-8, it would be written:

   open("file.txt", check_bom=True, encoding="UTF-8")

As explained before, check_bom=True is only accepted for read only file mode.

Well, why not. This is a third choice for my point (1) :-) It's between Guido 
and Antoine choice, and I like it because we can fallback to UTF-8 instead of 
the dummy system locale: Windows users will be happy to be able to use UTF-8 
:-) I prefer to fallback to a fixed encoding then depending on the system 
locale.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Victor Stinner
Le samedi 09 janvier 2010 02:12:28, MRAB a écrit :
> What about listing the possible encodings? It would try each in turn
> until it found one where the BOM matched or had no BOM:
> 
>  my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8')
>
> or is that taking it too far?

Yes, you're taking it foo far :-) Checking BOM is reliable, whereas *guessing* 
the charset only using the byte stream can only be an heuristic. Guess a 
charset is a complex problem, they are 3rd party library to do that, like the 
chardet project.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Victor Stinner
Le samedi 09 janvier 2010 12:18:33, Walter Dörwald a écrit :
> > Good idea, I choosed open(filename, encoding="BOM").
> 
> On the surface this looks like there's an encoding named "BOM", but
> looking at your patch I found that the check is still done in
> TextIOWrapper. IMHO the best approach would to the implement a *real*
> codec named "BOM" (or "sniff"). This doesn't require *any* changes to
> the IO library. It could even be developed as a standalone project and
> published in the Cheeseshop.

Why not, this is another solution to the point (2) (Check for a BOM while 
reading or detect it before?). Which encoding would be used if there is not 
BOM? UTF-8 sounds like a good choice.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Victor Stinner
Hi,

Le samedi 09 janvier 2010 13:45:58, vous avez écrit :
> > Note: I implemented the BOM check in TextIOWrapper; so it's already
> > usable for any file-like object.
> 
> Yes, but the implementation is limited to just BOM checking
> and thus only supports UTF-8-SIG, UTF-16 and UTF-32.

Sure, but that's already better than no BOM check :-) It looks like many 
people would apprecite UTF-8-SIG detection, since this encoding is common on 
Windows.

> BTW: I haven't looked at your implementation, but what happens
> when your BOM check fails ? Will the implementation add the
> already read bytes back to a buffer ?

My implementation is done between buffer.read() and decoder.decode(data). If 
there is a BOM: set the encoding and remove the BOM bytes from the byte 
string. Otherwise, use another algorithm to choose the encoding and leave the 
byte string unchanged.

It can be seen as a codec: it works like UTF-16 and UTF-32 codecs ;-)

> AFAIK, we currently have a moratorium on changes to Python
> builtins. How does that match up with the proposed changes ?

Oh yes, I forgot the moratorium. In all solutions, some of them don't change 
the API. Eg. Antoine proposed to leave the API unchanged: open(file) => 
open(file) :-) I don't know if it's compatible with the moratorium or not.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Antoine Pitrou
Walter Dörwald  livinglogic.de> writes:
> 
> On the surface this looks like there's an encoding named "BOM", but 
> looking at your patch I found that the check is still done in 
> TextIOWrapper. IMHO the best approach would to the implement a *real* 
> codec named "BOM" (or "sniff"). This doesn't require *any* changes to 
> the IO library. It could even be developed as a standalone project and 
> published in the Cheeseshop.

Sorry but this is missing the point. The point here is to improve the open()
function. I'm sure people who know about encodings are able to install the
chardet library or even whip up their own BOM detection routine...


Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] [RELEASED] Python 2.7 alpha 2

2010-01-09 Thread Benjamin Peterson
On behalf of the Python development team, I'm gleeful to announce the second
alpha release of Python 2.7.

Python 2.7 is scheduled to be the last major version in the 2.x series.  It
includes many features that were first released in Python 3.1.  The faster io
module, the new nested with statement syntax, improved float repr, and the
memoryview object have been backported from 3.1. Other features include an
ordered dictionary implementation, unittests improvements, and support for ttk
Tile in Tkinter.  For a more extensive list of changes in 2.7, see
http://doc.python.org/dev/whatsnew/2.7.html or Misc/NEWS in the Python
distribution.

To download Python 2.7 visit:

 http://www.python.org/download/releases/2.7/

Please note that this is a development release, intended as a preview of new
features for the community, and is thus not suitable for production use.

The 2.7 documentation can be found at:

 http://docs.python.org/2.7

Please consider trying Python 2.7 with your code and reporting any bugs you may
notice to:

 http://bugs.python.org


Have fun!

--
Benjamin Peterson
2.7 Release Manager
benjamin at python.org
(on behalf of the entire python-dev team and 2.7's contributors)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [RELEASED] Python 2.7 alpha 2

2010-01-09 Thread Karen Tracey
On Sat, Jan 9, 2010 at 12:29 PM, Benjamin Peterson wrote:

> On behalf of the Python development team, I'm gleeful to announce the
> second
> alpha release of Python 2.7.
>
>
Well yay.  Django's test suite (1242 tests) runs with just one failure on
the 2.7 alpha 2 level, and that looks to be likely due to the improved
string/float rounding so not really a problem, just a difference.  That's
down from 104 failures and 40 errors with 2.7 alpha 1.

Note on the website page http://www.python.org/download/releases/2.7/ the
"Change log for this release" link is still pointing to the alpha 1
changelog.

Thanks,
Karen
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [RELEASED] Python 2.7 alpha 2

2010-01-09 Thread Benjamin Peterson
2010/1/9 Karen Tracey :
> On Sat, Jan 9, 2010 at 12:29 PM, Benjamin Peterson 
> wrote:
>>
>> On behalf of the Python development team, I'm gleeful to announce the
>> second
>> alpha release of Python 2.7.
>>
>
> Well yay.  Django's test suite (1242 tests) runs with just one failure on
> the 2.7 alpha 2 level, and that looks to be likely due to the improved
> string/float rounding so not really a problem, just a difference.  That's
> down from 104 failures and 40 errors with 2.7 alpha 1.

Excellent!

>
> Note on the website page http://www.python.org/download/releases/2.7/ the
> "Change log for this release" link is still pointing to the alpha 1
> changelog.

Thanks. I'll fix that.

>



-- 
Regards,
Benjamin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Unladen cPickle speedups in 2.7 & 3.1

2010-01-09 Thread skip
How much of the Unladen Swallow cPickle speedups have been incorporated into
2.7 & 3.1?  I'm working on trying to develop patches for 2.4 and 2.6 (the
two versions I currently care about at work - we will skip 2.5 entirely).
It appears some of their speedups may have already been merged to trunk, but
I'm not sure how much.  If a patch to merge this to 2.7 is already under
consideration I won't look at it, but I interpreted Collin Winter's response
to my query on the u-s mailing list that not everything has been done yet.

Thx,

Skip

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unladen cPickle speedups in 2.7 & 3.1

2010-01-09 Thread skip

Philip> They've documented their upstream patches here:

Philip> http://code.google.com/p/unladen-swallow/wiki/UpstreamPatches

Thanks.  That will help immensely.

Skip

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unladen cPickle speedups in 2.7 & 3.1

2010-01-09 Thread Antoine Pitrou
 pobox.com> writes:
> 
> If a patch to merge this to 2.7 is already under
> consideration I won't look at it,

Why won't you look at it? :)
Actually, if these patches are to be merged someone should certainly look at
them, and do the (possibly) remaining work.

http://bugs.python.org/issue5683
http://bugs.python.org/issue5671

Thank you

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unladen cPickle speedups in 2.7 & 3.1

2010-01-09 Thread Philip Jenvey

On Jan 9, 2010, at 12:00 PM, [email protected] wrote:

> How much of the Unladen Swallow cPickle speedups have been incorporated into
> 2.7 & 3.1?  I'm working on trying to develop patches for 2.4 and 2.6 (the
> two versions I currently care about at work - we will skip 2.5 entirely).
> It appears some of their speedups may have already been merged to trunk, but
> I'm not sure how much.  If a patch to merge this to 2.7 is already under
> consideration I won't look at it, but I interpreted Collin Winter's response
> to my query on the u-s mailing list that not everything has been done yet.

They've documented their upstream patches here:

http://code.google.com/p/unladen-swallow/wiki/UpstreamPatches

--
Philip Jenvey

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Martin v. Löwis
Antoine Pitrou wrote:
> Walter Dörwald  livinglogic.de> writes:
>> On the surface this looks like there's an encoding named "BOM", but 
>> looking at your patch I found that the check is still done in 
>> TextIOWrapper. IMHO the best approach would to the implement a *real* 
>> codec named "BOM" (or "sniff"). This doesn't require *any* changes to 
>> the IO library. It could even be developed as a standalone project and 
>> published in the Cheeseshop.
> 
> Sorry but this is missing the point. The point here is to improve the open()
> function. I'm sure people who know about encodings are able to install the
> chardet library or even whip up their own BOM detection routine...

How does the requirement that it be implemented as a codec miss the
point?

FWIW, I agree with Walter that if it is provided through the encoding=
argument, it should be a codec. If it is built into the open function
(for whatever reason), it must be provided by some other parameter.

I do see the point that it becomes available to end users only when
released as part of Python. However, this *also* means that applications
won't be using it for another three years or so, since they'll have to
support older Python versions as well (unless it is integrated in the
case where no encoding is specified).

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Antoine Pitrou
Martin v. Löwis  v.loewis.de> writes:
> 
> > Sorry but this is missing the point. The point here is to improve the open()
> > function. I'm sure people who know about encodings are able to install the
> > chardet library or even whip up their own BOM detection routine...
> 
> How does the requirement that it be implemented as a codec miss the
> point?

If we want it to be the default, it must be able to fallback on the current
locale-based algorithm if no BOM is found. I don't think it would be easy for a
codec to do that.

> FWIW, I agree with Walter that if it is provided through the encoding=
> argument, it should be a codec. If it is built into the open function
> (for whatever reason), it must be provided by some other parameter.

Why not simply encoding=None? The default value should provide the most useful
behaviour possible. Forcing users to choose between two different autodetection
strategies (encoding=None and another one) is a little insane IMO.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unladen cPickle speedups in 2.7 & 3.1

2010-01-09 Thread skip
> "Antoine" == Antoine Pitrou  writes:

Antoine>  pobox.com> writes:
>> 
>> If a patch to merge this to 2.7 is already under
>> consideration I won't look at it,

Antoine> Why won't you look at it? :)

I meant I wouldn't look at developing one.

Skip
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Lennart Regebro
On Sat, Jan 9, 2010 at 21:28, Antoine Pitrou  wrote:
> If we want it to be the default, it must be able to fallback on the current
> locale-based algorithm if no BOM is found. I don't think it would be easy for 
> a
> codec to do that.

Right. It seems like encoding=None is the right way to go there.
encoding='BOM' would probably only work if 'BOM' isn't an encoding but
a special tag, which is ugly.

-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Michael Foord

On 09/01/2010 22:14, Lennart Regebro wrote:

On Sat, Jan 9, 2010 at 21:28, Antoine Pitrou  wrote:
   

If we want it to be the default, it must be able to fallback on the current
locale-based algorithm if no BOM is found. I don't think it would be easy for a
codec to do that.
 

Right. It seems like encoding=None is the right way to go there.
encoding='BOM' would probably only work if 'BOM' isn't an encoding but
a special tag, which is ugly.

   
I would rather see it as the default behavior for open without an 
encoding specified.


I know Guido has expressed a preference against this so I won't continue 
to flog it.


The current behavior however is that we have a 'guessing' algorithm 
based on the platform default. Currently if you open a text file in read 
mode that has a UTF-8 signature, but the platform default is something 
other than UTF-8, then we open the file using what is likely to be the 
incorrect encoding. Looking for the signature seems to be better 
behaviour in that case.


All the best,

Michael

--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Martin v. Löwis
>> How does the requirement that it be implemented as a codec miss the
>> point?
> 
> If we want it to be the default, it must be able to fallback on the current
> locale-based algorithm if no BOM is found. I don't think it would be easy for 
> a
> codec to do that.

Yes - however, Victor currently apparently *doesn't* want it to be the
default, but wants the user to specify encoding="BOM". If so, it isn't
the default, and it is easy to implement as a codec.

>> FWIW, I agree with Walter that if it is provided through the encoding=
>> argument, it should be a codec. If it is built into the open function
>> (for whatever reason), it must be provided by some other parameter.
> 
> Why not simply encoding=None?

I don't mind. Please re-read Walter's message - it only said that
*if* this is activated through encoding="BOM", *then* it must be
a codec, and could be on PyPI. I don't think Walter was talking about
the case "it is not activated through encoding='BOM'" *at all*.

> The default value should provide the most useful
> behaviour possible. Forcing users to choose between two different 
> autodetection
> strategies (encoding=None and another one) is a little insane IMO.

That wouldn't disturb me much. There are a lot of things in that area
that are a little insane, starting with Microsoft Windows :-)

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com