[issue2857] add codec for java modified utf-8

2011-08-12 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Tom Christiansen wrote: > > Tom Christiansen added the comment: > > Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at: > > http://unicode.org/reports/tr26/ > > CESU-8 is *not* a a valid Unicode Transform Format and should not b

[issue2857] add codec for java modified utf-8

2011-08-11 Thread Georg Brandl
Georg Brandl added the comment: +1 for calling it by the correct name (the docs can of course state that this is equivalent to "Java Modified UTF-8" or however they like to call it). -- ___ Python tracker

[issue2857] add codec for java modified utf-8

2011-08-11 Thread Tom Christiansen
Tom Christiansen added the comment: Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at: http://unicode.org/reports/tr26/ CESU-8 is *not* a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by people who misunderand

[issue2857] add codec for java modified utf-8

2011-05-11 Thread STINNER Victor
STINNER Victor added the comment: Benchmark: a) ./python -m timeit "(b'\xc3\xa9' * 1).decode('utf-8')" b)./python -m timeit "(''.join( map(chr, range(0, 128)) )*1000).encode('utf-8')" c) ./python -m timeit "f=open('Misc/ACKS', encoding='utf-8'); acks=f.read(); f.close()" "acks.encode('utf-8

[issue2857] add codec for java modified utf-8

2011-05-11 Thread STINNER Victor
STINNER Victor added the comment: See also issue #1028. -- ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http

[issue2857] add codec for java modified utf-8

2011-05-11 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Thanks for the patch, Victor. Some comments on the patch: * the codec will have to be able to work with lone surrogates (see the wikipedia page explaining this detail), which the UTF-8 codec in Python 3.x no longer does, so another special case i

[issue2857] add codec for java modified utf-8

2011-05-11 Thread STINNER Victor
STINNER Victor added the comment: utf_8_java.patch: Implement "utf-8-java" encoding. * It has no alias * 'a\0b'.encode('utf-8-java') returns b'a\xc0\x80b' * b'a\xc0\x80b'.decode('utf-8-java') returns 'a\x00b' * I added some tests to utf-8 codec (test_invalid, test_null_byte) * I added many

[issue2857] add codec for java modified utf-8

2011-05-10 Thread Adal Chiriliuc
Adal Chiriliuc added the comment: I use the hachoir Python package to parse Java .class files and extract the strings from them and having support for Java modified UTF-8 would have been nice. -- nosy: +adalx ___ Python tracker

[issue2857] add codec for java modified utf-8

2010-12-10 Thread STINNER Victor
STINNER Victor added the comment: > I wonder if tkinter should use this encoding. Tkinter is used to build graphical interfaces. I don't think that users write nul bytes with their keyboard. But there is maybe a use case? -- ___ Python tracker

[issue2857] add codec for java modified utf-8

2010-12-06 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: > TCL only uses the codec for internal represenation. You might want to > interface to TCL at the C level and use the codec there, but is that > really a good reason to include the codec in the Python stdlib ? I wonder if tkinter should use this encoding.

[issue2857] add codec for java modified utf-8

2009-05-16 Thread Daniel Diniz
Changes by Daniel Diniz : -- components: +Unicode nosy: +ezio.melotti, haypo priority: -> normal stage: -> test needed type: -> feature request versions: +Python 2.7, Python 3.2 -Python 2.5 ___ Python tracker ___

[issue2857] add codec for java modified utf-8

2008-05-15 Thread Georg Brandl
Georg Brandl <[EMAIL PROTECTED]> added the comment: Since we also support oddball codecs like UTF-8-SIG, why not this one too? Given the importance of UTF-8, it seems a good idea to support common variations. -- nosy: +georg.brandl __ Tracker <[EMAIL PRO

[issue2857] add codec for java modified utf-8

2008-05-15 Thread paul rubin
paul rubin <[EMAIL PROTECTED]> added the comment: I'm not sure what you mean by "ditto for Lucene indexes". I wasn't planning to use C code. I was hoping to write Python code to parse those indexes, then found they use this weird encoding, and Python's codec set is fairly inclusive already, so

[issue2857] add codec for java modified utf-8

2008-05-15 Thread Marc-Andre Lemburg
Marc-Andre Lemburg <[EMAIL PROTECTED]> added the comment: TCL only uses the codec for internal represenation. You might want to interface to TCL at the C level and use the codec there, but is that really a good reason to include the codec in the Python stdlib ? Dito for parsing Lucene indexes.

[issue2857] add codec for java modified utf-8

2008-05-15 Thread paul rubin
paul rubin <[EMAIL PROTECTED]> added the comment: Also, according to wikipedia, tcl also uses that encoding. __ Tracker <[EMAIL PROTECTED]> __ ___ Python

[issue2857] add codec for java modified utf-8

2008-05-15 Thread paul rubin
paul rubin <[EMAIL PROTECTED]> added the comment: Some java applications use it externally. The purpose seems to be to prevent NUL bytes from appearing inside encoded strings which can confuse C libraries that expect NUL's to terminate strings. My immediate application is parsing lucene indexes

[issue2857] add codec for java modified utf-8

2008-05-15 Thread Marc-Andre Lemburg
Marc-Andre Lemburg <[EMAIL PROTECTED]> added the comment: What would you use such a codec for ? >From the references you gave, it is only used internally for Java object serialization, so wouldn't really be of much use in Python. -- nosy: +lemburg title: add coded for java modified utf-