Re: [Python-Dev] Quick sum up about open() + BOM
On 09.01.10 01:47, Glenn Linderman wrote: > On approximately 1/8/2010 3:59 PM, came the following characters from > the keyboard of Victor Stinner: >> Hi, >> >> Thanks for all the answers! I will try to sum up all ideas here. > > One concern I have with this implementation encoding="BOM" is that if > there is no BOM it assumes UTF-8. That is probably a good assumption in > some circumstances, but not in others. > > * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE > encoded files include a BOM. It is only required that UTF-16 and UTF-32 > (cases where the endianness is unspecified) contain a BOM. Hence, it > might be that someone would expect a UTF-16LE (or any of the formats > that don't require a BOM, rather than UTF-8), but be willing to accept > any BOM-discriminated format. > > * Potentially, this could be expanded beyond the various Unicode > encodings... one could envision that a program whose data files > historically were in any particular national language locale, could want > to be enhance to accept Unicode, and could declare that they will accept > any BOM-discriminated format, but want to default, in the absence of a > BOM, to the original national language locale that they historically > accepted. That would provide a migration path for their old data files. > > So the point is, that it might be nice to have > "BOM-otherEncodingForDefault" for each other encoding that Python > supports. Not sure that is the right API, but I think it is expressive > enough to handle the cases above. Whether the cases solve actual > problems or not, I couldn't say, but they seem like reasonable cases. This is doable with the currect API. Simply define a codec search function that handles all encoding names that start with "BOM-" and pass the "otherEncodingForDefault" part along to the codec. > It would, of course, be nicest if OS metadata had been invented way back > when, for all OSes, such that all text files were flagged with their > encoding... then languages could just read the encoding and do the right > thing! But we live in the real world, instead. Servus, Walter ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Victor Stinner wrote: Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit : Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be "ignored". It depends. If you use the utf-8-sig encoding, it *will* ignore the UTF-8 signature. Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to remove the BOM after the first read (much harder if you use a module like ConfigParser or csv). Since my proposition changes the result TextIOWrapper.read()/readline() for files starting with a BOM, we might introduce an option to open() to enable the new behaviour. But is it really needed to keep the backward compatibility? Absolutely. And there is no need to produce a new option, but instead use the existing options: define an encoding that auto-detects the encoding from the family of BOMs. Maybe you call it encoding="sniff". Good idea, I choosed open(filename, encoding="BOM"). On the surface this looks like there's an encoding named "BOM", but looking at your patch I found that the check is still done in TextIOWrapper. IMHO the best approach would to the implement a *real* codec named "BOM" (or "sniff"). This doesn't require *any* changes to the IO library. It could even be developed as a standalone project and published in the Cheeseshop. To see how something like this can be done, take a look at the UTF-16 codec, that switches to bigendian or littleendian mode depending on the first read/decode call. Servus, Walter ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Le samedi 09 janvier 2010 02:23:07, Martin v. Löwis a écrit :
> While I would support combining BOM detection in the case where a file
> is opened for reading and no encoding is specified, I see two problems:
> a) if a seek operations is performed before having looked at the BOM,
>no determination would have been made
TextIOWrapper doesn't support seek to an arbitrary byte. It uses "cookie"
which is an opaque value. Reuse a cookie from another file or an old cookie is
forbidden (but it doesn't raise an error). This is not specific to the BOM
checking: the problem already exist for encodings using a BOM (eg. UTF-16).
> b) what encoding should it use on writing?
Don't change anything to writing.
With Antoince choice: open('file.txt', 'w', encoding=None) continue to use the
actual heuristic (os.device_encoding() or system locale).
With Guido choice, encoding="BOM": it raises an error, because BOM check is
not supported when writing into a file. How could the BOM be checked when
creating a new (empty) file!?
--
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Victor Stinner wrote: > (2) Check for a BOM while reading or detect it before? > > Everybody agree that checking BOM is an interesting option and should not be > limited to open(). > > Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file > name or a binary file-like object: it returns the encoding and seek to the > file start or just after the BOM. > > I dislike this function because it requires extra file operations (open > (optional), read() and seek()) and it doesn't work if the file is not > seekable > (eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to > avoid extra file operations. > > Note: I implemented the BOM check in TextIOWrapper; so it's already usable > for > any file-like object. Yes, but the implementation is limited to just BOM checking and thus only supports UTF-8-SIG, UTF-16 and UTF-32. With a codecs module function we could easily extend the encoding detection to more file types, e.g. XML files, Python source code files, etc. that use other mechanisms for defining the encoding. BTW: I haven't looked at your implementation, but what happens when your BOM check fails ? Will the implementation add the already read bytes back to a buffer ? This rollback action is the only reason for needing a seekable stream in codecs.guess_stream_encoding(). Another point to consider: AFAIK, we currently have a moratorium on changes to Python builtins. How does that match up with the proposed changes ? Using a new codec like Walter suggested would move the implementation into the stdlib for which doesn't the moratorium doesn't apply. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 09 2010) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Le samedi 09 janvier 2010 01:47:38, vous avez écrit :
> One concern I have with this implementation encoding="BOM" is that if
> there is no BOM it assumes UTF-8.
If no BOM is found, it fallback to the current heuristic: os.device_encoding()
or system local.
> (...) Hence, it might be that someone would expect a UTF-16LE (or any of
> the formats that don't require a BOM, rather than UTF-8), but be willing
> to accept any BOM-discriminated format.
> (...) declare that they will accept
> any BOM-discriminated format, but want to default, in the absence of a
> BOM, to the original national language locale that they historically
> accepted
You mean "if there is a BOM, use it, otherwise fallback to a specific
charset"? How could it be declared? Maybe:
open("file.txt", check_bom=True, encoding="UTF16-LE")
open("file.txt", check_bom=True, encoding="latin1")
About falling back to UTF-8, it would be written:
open("file.txt", check_bom=True, encoding="UTF-8")
As explained before, check_bom=True is only accepted for read only file mode.
Well, why not. This is a third choice for my point (1) :-) It's between Guido
and Antoine choice, and I like it because we can fallback to UTF-8 instead of
the dummy system locale: Windows users will be happy to be able to use UTF-8
:-) I prefer to fallback to a fixed encoding then depending on the system
locale.
--
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Le samedi 09 janvier 2010 02:12:28, MRAB a écrit : > What about listing the possible encodings? It would try each in turn > until it found one where the BOM matched or had no BOM: > > my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8') > > or is that taking it too far? Yes, you're taking it foo far :-) Checking BOM is reliable, whereas *guessing* the charset only using the byte stream can only be an heuristic. Guess a charset is a complex problem, they are 3rd party library to do that, like the chardet project. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Le samedi 09 janvier 2010 12:18:33, Walter Dörwald a écrit : > > Good idea, I choosed open(filename, encoding="BOM"). > > On the surface this looks like there's an encoding named "BOM", but > looking at your patch I found that the check is still done in > TextIOWrapper. IMHO the best approach would to the implement a *real* > codec named "BOM" (or "sniff"). This doesn't require *any* changes to > the IO library. It could even be developed as a standalone project and > published in the Cheeseshop. Why not, this is another solution to the point (2) (Check for a BOM while reading or detect it before?). Which encoding would be used if there is not BOM? UTF-8 sounds like a good choice. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Hi, Le samedi 09 janvier 2010 13:45:58, vous avez écrit : > > Note: I implemented the BOM check in TextIOWrapper; so it's already > > usable for any file-like object. > > Yes, but the implementation is limited to just BOM checking > and thus only supports UTF-8-SIG, UTF-16 and UTF-32. Sure, but that's already better than no BOM check :-) It looks like many people would apprecite UTF-8-SIG detection, since this encoding is common on Windows. > BTW: I haven't looked at your implementation, but what happens > when your BOM check fails ? Will the implementation add the > already read bytes back to a buffer ? My implementation is done between buffer.read() and decoder.decode(data). If there is a BOM: set the encoding and remove the BOM bytes from the byte string. Otherwise, use another algorithm to choose the encoding and leave the byte string unchanged. It can be seen as a codec: it works like UTF-16 and UTF-32 codecs ;-) > AFAIK, we currently have a moratorium on changes to Python > builtins. How does that match up with the proposed changes ? Oh yes, I forgot the moratorium. In all solutions, some of them don't change the API. Eg. Antoine proposed to leave the API unchanged: open(file) => open(file) :-) I don't know if it's compatible with the moratorium or not. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Walter Dörwald livinglogic.de> writes: > > On the surface this looks like there's an encoding named "BOM", but > looking at your patch I found that the check is still done in > TextIOWrapper. IMHO the best approach would to the implement a *real* > codec named "BOM" (or "sniff"). This doesn't require *any* changes to > the IO library. It could even be developed as a standalone project and > published in the Cheeseshop. Sorry but this is missing the point. The point here is to improve the open() function. I'm sure people who know about encodings are able to install the chardet library or even whip up their own BOM detection routine... Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] [RELEASED] Python 2.7 alpha 2
On behalf of the Python development team, I'm gleeful to announce the second alpha release of Python 2.7. Python 2.7 is scheduled to be the last major version in the 2.x series. It includes many features that were first released in Python 3.1. The faster io module, the new nested with statement syntax, improved float repr, and the memoryview object have been backported from 3.1. Other features include an ordered dictionary implementation, unittests improvements, and support for ttk Tile in Tkinter. For a more extensive list of changes in 2.7, see http://doc.python.org/dev/whatsnew/2.7.html or Misc/NEWS in the Python distribution. To download Python 2.7 visit: http://www.python.org/download/releases/2.7/ Please note that this is a development release, intended as a preview of new features for the community, and is thus not suitable for production use. The 2.7 documentation can be found at: http://docs.python.org/2.7 Please consider trying Python 2.7 with your code and reporting any bugs you may notice to: http://bugs.python.org Have fun! -- Benjamin Peterson 2.7 Release Manager benjamin at python.org (on behalf of the entire python-dev team and 2.7's contributors) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [RELEASED] Python 2.7 alpha 2
On Sat, Jan 9, 2010 at 12:29 PM, Benjamin Peterson wrote: > On behalf of the Python development team, I'm gleeful to announce the > second > alpha release of Python 2.7. > > Well yay. Django's test suite (1242 tests) runs with just one failure on the 2.7 alpha 2 level, and that looks to be likely due to the improved string/float rounding so not really a problem, just a difference. That's down from 104 failures and 40 errors with 2.7 alpha 1. Note on the website page http://www.python.org/download/releases/2.7/ the "Change log for this release" link is still pointing to the alpha 1 changelog. Thanks, Karen ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [RELEASED] Python 2.7 alpha 2
2010/1/9 Karen Tracey : > On Sat, Jan 9, 2010 at 12:29 PM, Benjamin Peterson > wrote: >> >> On behalf of the Python development team, I'm gleeful to announce the >> second >> alpha release of Python 2.7. >> > > Well yay. Django's test suite (1242 tests) runs with just one failure on > the 2.7 alpha 2 level, and that looks to be likely due to the improved > string/float rounding so not really a problem, just a difference. That's > down from 104 failures and 40 errors with 2.7 alpha 1. Excellent! > > Note on the website page http://www.python.org/download/releases/2.7/ the > "Change log for this release" link is still pointing to the alpha 1 > changelog. Thanks. I'll fix that. > -- Regards, Benjamin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Unladen cPickle speedups in 2.7 & 3.1
How much of the Unladen Swallow cPickle speedups have been incorporated into 2.7 & 3.1? I'm working on trying to develop patches for 2.4 and 2.6 (the two versions I currently care about at work - we will skip 2.5 entirely). It appears some of their speedups may have already been merged to trunk, but I'm not sure how much. If a patch to merge this to 2.7 is already under consideration I won't look at it, but I interpreted Collin Winter's response to my query on the u-s mailing list that not everything has been done yet. Thx, Skip ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unladen cPickle speedups in 2.7 & 3.1
Philip> They've documented their upstream patches here: Philip> http://code.google.com/p/unladen-swallow/wiki/UpstreamPatches Thanks. That will help immensely. Skip ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unladen cPickle speedups in 2.7 & 3.1
pobox.com> writes: > > If a patch to merge this to 2.7 is already under > consideration I won't look at it, Why won't you look at it? :) Actually, if these patches are to be merged someone should certainly look at them, and do the (possibly) remaining work. http://bugs.python.org/issue5683 http://bugs.python.org/issue5671 Thank you Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unladen cPickle speedups in 2.7 & 3.1
On Jan 9, 2010, at 12:00 PM, [email protected] wrote: > How much of the Unladen Swallow cPickle speedups have been incorporated into > 2.7 & 3.1? I'm working on trying to develop patches for 2.4 and 2.6 (the > two versions I currently care about at work - we will skip 2.5 entirely). > It appears some of their speedups may have already been merged to trunk, but > I'm not sure how much. If a patch to merge this to 2.7 is already under > consideration I won't look at it, but I interpreted Collin Winter's response > to my query on the u-s mailing list that not everything has been done yet. They've documented their upstream patches here: http://code.google.com/p/unladen-swallow/wiki/UpstreamPatches -- Philip Jenvey ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Antoine Pitrou wrote: > Walter Dörwald livinglogic.de> writes: >> On the surface this looks like there's an encoding named "BOM", but >> looking at your patch I found that the check is still done in >> TextIOWrapper. IMHO the best approach would to the implement a *real* >> codec named "BOM" (or "sniff"). This doesn't require *any* changes to >> the IO library. It could even be developed as a standalone project and >> published in the Cheeseshop. > > Sorry but this is missing the point. The point here is to improve the open() > function. I'm sure people who know about encodings are able to install the > chardet library or even whip up their own BOM detection routine... How does the requirement that it be implemented as a codec miss the point? FWIW, I agree with Walter that if it is provided through the encoding= argument, it should be a codec. If it is built into the open function (for whatever reason), it must be provided by some other parameter. I do see the point that it becomes available to end users only when released as part of Python. However, this *also* means that applications won't be using it for another three years or so, since they'll have to support older Python versions as well (unless it is integrated in the case where no encoding is specified). Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Martin v. Löwis v.loewis.de> writes: > > > Sorry but this is missing the point. The point here is to improve the open() > > function. I'm sure people who know about encodings are able to install the > > chardet library or even whip up their own BOM detection routine... > > How does the requirement that it be implemented as a codec miss the > point? If we want it to be the default, it must be able to fallback on the current locale-based algorithm if no BOM is found. I don't think it would be easy for a codec to do that. > FWIW, I agree with Walter that if it is provided through the encoding= > argument, it should be a codec. If it is built into the open function > (for whatever reason), it must be provided by some other parameter. Why not simply encoding=None? The default value should provide the most useful behaviour possible. Forcing users to choose between two different autodetection strategies (encoding=None and another one) is a little insane IMO. Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unladen cPickle speedups in 2.7 & 3.1
> "Antoine" == Antoine Pitrou writes: Antoine> pobox.com> writes: >> >> If a patch to merge this to 2.7 is already under >> consideration I won't look at it, Antoine> Why won't you look at it? :) I meant I wouldn't look at developing one. Skip ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Sat, Jan 9, 2010 at 21:28, Antoine Pitrou wrote: > If we want it to be the default, it must be able to fallback on the current > locale-based algorithm if no BOM is found. I don't think it would be easy for > a > codec to do that. Right. It seems like encoding=None is the right way to go there. encoding='BOM' would probably only work if 'BOM' isn't an encoding but a special tag, which is ugly. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On 09/01/2010 22:14, Lennart Regebro wrote: On Sat, Jan 9, 2010 at 21:28, Antoine Pitrou wrote: If we want it to be the default, it must be able to fallback on the current locale-based algorithm if no BOM is found. I don't think it would be easy for a codec to do that. Right. It seems like encoding=None is the right way to go there. encoding='BOM' would probably only work if 'BOM' isn't an encoding but a special tag, which is ugly. I would rather see it as the default behavior for open without an encoding specified. I know Guido has expressed a preference against this so I won't continue to flog it. The current behavior however is that we have a 'guessing' algorithm based on the platform default. Currently if you open a text file in read mode that has a UTF-8 signature, but the platform default is something other than UTF-8, then we open the file using what is likely to be the incorrect encoding. Looking for the signature seems to be better behaviour in that case. All the best, Michael -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
>> How does the requirement that it be implemented as a codec miss the >> point? > > If we want it to be the default, it must be able to fallback on the current > locale-based algorithm if no BOM is found. I don't think it would be easy for > a > codec to do that. Yes - however, Victor currently apparently *doesn't* want it to be the default, but wants the user to specify encoding="BOM". If so, it isn't the default, and it is easy to implement as a codec. >> FWIW, I agree with Walter that if it is provided through the encoding= >> argument, it should be a codec. If it is built into the open function >> (for whatever reason), it must be provided by some other parameter. > > Why not simply encoding=None? I don't mind. Please re-read Walter's message - it only said that *if* this is activated through encoding="BOM", *then* it must be a codec, and could be on PyPI. I don't think Walter was talking about the case "it is not activated through encoding='BOM'" *at all*. > The default value should provide the most useful > behaviour possible. Forcing users to choose between two different > autodetection > strategies (encoding=None and another one) is a little insane IMO. That wouldn't disturb me much. There are a lot of things in that area that are a little insane, starting with Microsoft Windows :-) Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
