[issue1037] Ill-coded identifier crashes python when coding spec is utf-8
New submission from Hye-Shik Chang: Illegal identifier makes python crash on UTF-8 source codes/interpreters. Python 3.0x (py3k:57555M, Aug 27 2007, 21:23:47) [GCC 3.4.6 [FreeBSD] 20060305] on freebsd6 >>> compile(b'#coding:utf-8\n\xfc', '', 'exec') zsh: segmentation fault (core dumped) ./python The problem is that tokenizer.c:verify_identifer doesn't check return value from PyUnicode_DecodeUTF8 but some invalid utf8 sequences could be there. -- components: Unicode keywords: py3k messages: 55335 nosy: hyeshik.chang priority: high severity: normal status: open title: Ill-coded identifier crashes python when coding spec is utf-8 type: crash versions: Python 3.0 __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1037> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE
Hye-Shik Chang <[EMAIL PROTECTED]> added the comment: Added a patch that implements codecs for CJK Macintosh encodings. I tried to implement that just alike the other existing CJK codecs, but it required many inefficient mapping tables due to their odd mappings (like this: u'ABCDE' <-> 'ab' AND u'ABCD' <-> 'ac'!). So, I decided to implement a general extension codec wrapper that can be easily modified by dictionaries given by Python code. Because all Mac CJK encodings have codecs that implement their base encodings, I just put their difference in Python codec code. The extension mechanism may be reused in customized codecs for in-house applications or legacy encoding supports. The first patch was generated for 2.6 trunk. I'm working on porting it to 3.0. Added file: http://bugs.python.org/file10743/maccjkcodecs-1.diff ___ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1276> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE
Changes by Hye-Shik Chang <[EMAIL PROTECTED]>: Added file: http://bugs.python.org/file10749/maccjkcodecs-1-py3k.diff ___ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1276> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE
Changes by Hye-Shik Chang <[EMAIL PROTECTED]>: Added file: http://bugs.python.org/file11170/cjkmactemporary.diff ___ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1276> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE
Hye-Shik Chang <[EMAIL PROTECTED]> added the comment: Committed patch "cjkmactemporary.diff" as r65988 in the py3k branch. I'll open another issue for cjkcodecs implementation of Mac codecs. -- resolution: -> fixed status: open -> closed ___ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1276> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3685] Crash while compiling Python 3000 in OpenBSD 4.4
Hye-Shik Chang <[EMAIL PROTECTED]> added the comment: This problem is due to OpenBSD's libc bug. It's fixed 3 days ago. (http://www.openbsd.org/cgi- bin/cvsweb/src/lib/libc/string/wcschr.c#rev1.4) We can workaround by replacing use of wcschr(ws, L'\0') to ws + wcslen(ws). -- nosy: +hyeshik.chang ___ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3685> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3594] PyTokenizer_FindEncoding() never succeeds
Hye-Shik Chang <[EMAIL PROTECTED]> added the comment: pitrou, that's because Python source code can't be correctly tokenized when it's encoded in few odd encodings like iso-2022 or shift-jis which utilizes \, (, ) and " as second byte of two-byte character sequence. For example, '\x81\\' is HORIZONTAL BAR in shift-jis, exec('print "\x81\\"') fails. because of " is ignored by second byte of '\x81\\'. -- nosy: +hyeshik.chang ___ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3594> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643 support, a *huge* charset, in cjkcodecs
New submission from Hye-Shik Chang: This patch adds CNS11643 support into Python unicode codecs. CNS11643 is a huge character which is used in EUC-TW and ISO-2022-CN. CJKCodecs have had the CNS11643 support for 4 years at least, but I dropped it because of its huge size in integrating into Python. EUC-TW and ISO-2022-CN aren't being used widely while they are still regarded as part of major encodings yet. In my patch, disabling the CNS11643 charset support is possible by adding -DNO_CNS11643 in CFLAGS for light platforms. Mapping source code size of the charset is 900K and it adds about 350K into _codecs_tw.so (in POSIX) or python26.dll (in Win32). What do you think about adding this code? -- components: Unicode files: cns11643-r1.diff.gz messages: 62282 nosy: hyeshik.chang priority: low severity: normal status: open title: Adding new CNS11643 support, a *huge* charset, in cjkcodecs versions: Python 2.6, Python 3.0 Added file: http://bugs.python.org/file9408/cns11643-r1.diff.gz __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue2066> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Changes by Hye-Shik Chang: -- title: Adding new CNS11643 support, a *huge* charset, in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue2066> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Hye-Shik Chang added the comment: I've generated the mapping table from ICU's CNS11643-1992 mapping. I see that CNS11643 is quite rarely used in the internet, but it's the only national standard character set in Taiwan. Asking Taiwanese python users, even they didn't think that it's necessary to add into Python. I'll study how much compression is possible and how efficient it is, then submit a revised patch again. Thank you for comments! __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue2066> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Hye-Shik Chang added the comment: I have generated compressed mapping tables by several ways. I extracted mapping data into individual files and reorganized them by translating into Python source code or archiving into a zip file. The following table shows the result: (in kilobytes) (also available at http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA ) noneminimal MSjkMSall current Text0 207 312 342 570 Data904 696 592 562 333 raw-py 3006239220161932996 zip-py 720 496 416 384 304 raw-pyc 952 734 624 590 346 zip-pyc 560 384 336 304 240 Text+zip-pyc560 591 648 646 810 raw-both39543124263825201340 zip-both1248864 736 672 512 zip-bare560 384 336 304 240 tarbz2-bare 496 352 320 304 240 Columns represent which mapping files are separated into external files. In "none", no mapping is left as static const C data while only new cns11643 mappings are extracted in "current" column. "minimal" set has the major character set for each country in static C data and other are out. And "MSjk" includes some more MS codepages of Japan and Korea, and "MSall" includes all MS codepage extensions in static const C data. We may fix the list which character sets remain as C data or let users pick the sets using configure option. "Text" is portion that remains in static const C data where is all the current mapping tables are in. As discussed when CJKCodecs had been integrated into python, it can be shared over processes in a system and efficient, but it can't be compressed or reorganized easily by users for redistribution. "Data" is externally managed mapping tables. "raw-py" row shows total volume of mapping tables as in Python source code. "raw-pyc" shows compiled (pyc) version of mapping tables. "zip-py" and "zip-pyc" are zip-compressed archive of "raw-py" and "raw-pyc", respectively. Those can be imported using python zipimport machinery. "zip-bare" and "tarbz2-bare" shows volume of archived raw mapping table files as you can notice from their name. We have 560KB of mapping tables in the Python CJKCodecs part. If we choose "zip-pyc" of "minimal" set, the binary distribution will be just as big as before even if we include CNS11643 character set and pythonXY.dll will get smaller by 363KB. What do you think about the scheme or Any other idea for compression? __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue2066> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Hye-Shik Chang added the comment: I couldn't find an appropriate method to implement in situ compressed mapping table. AFAIK, python has the smallest mapping table footprint for each charset among major open source transcoding programs. I have thought about the compression many times, but every neat method required severe performance sacrifice. __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue2066> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE
Hye-Shik Chang added the comment: I'll take this. -- assignee: lemburg -> hyeshik.chang nosy: +hyeshik.chang __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1276> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Hye-Shik Chang added the comment: When I asked Taiwanese developers how often they use these character sets, it appeared that they are almost useless in the usual computing environment in Taiwan. This will only serve for a historical compatibility and literal standard compliance. I'm quite neutral in adding this into python without any user's request from Taiwan (I'm from South Korea :), but I can finish committing it with pleasure if you are still fond of the codec. -- ___ Python tracker <http://bugs.python.org/issue2066> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5640] Wrong print() result when unicode error handler is not 'strict'
Hye-Shik Chang added the comment: Right. Here I upload a patch to fix the addressed problem on cjkcodecs. Please test whether the patch corrects the behavior. -- keywords: +patch Added file: http://bugs.python.org/file13572/cjkcodecs-fix-statefulenc.diff ___ Python tracker <http://bugs.python.org/issue5640> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5640] Wrong print() result when unicode error handler is not 'strict'
Hye-Shik Chang added the comment: Sorry. I just found that the fix breaks few other test units. I'll check. -- ___ Python tracker <http://bugs.python.org/issue5640> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com