Bugs item #1249749, was opened at 2005-08-01 20:23 Message generated for change (Comment added) made by lemburg You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1249749&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Documentation Group: Python 2.4 Status: Open Resolution: None Priority: 5 Submitted By: liturgist (liturgist) Assigned to: Nobody/Anonymous (nobody) Summary: Encodings and aliases do not match runtime Initial Comment: 2.4.1 documentation has a list of standard encodings in 4.9.2. However, this list does not seem to match what is returned by the runtime. Below is code to dump out the encodings and aliases. Please tell me if anything is incorrect. In some cases, there are many more valid aliases than listed in the documentation. See 'cp037' as an example. I see that the identifiers are intended to be case insensitive. I would prefer to see the documentation provide the identifiers as they will appear in encodings.aliases.aliases. The only alias containing any upper case letters appears to be 'hp_roman8'. $ cat encodingaliases.py #!/usr/bin/env python import sys import encodings def main(): enchash = {} for enc in encodings.aliases.aliases.values(): enchash[enc] = [] for encalias in encodings.aliases.aliases.keys(): enchash[encodings.aliases.aliases[encalias]].append(encalias) elist = enchash.keys() elist.sort() for enc in elist: print enc, enchash[enc] if __name__ == '__main__': main() sys.exit(0) 13:12 pwatson [ ruth.knightsbridge.com:/home/pwatson/src/python ] 366 $ ./encodingaliases.py ascii ['iso_ir_6', 'ansi_x3_4_1968', 'ibm367', 'iso646_us', 'us', 'cp367', '646', 'us_ascii', 'csascii', 'ansi_x3.4_1986', 'iso_646.irv_1991', 'ansi_x3.4_1968'] base64_codec ['base_64', 'base64'] big5 ['csbig5', 'big5_tw'] big5hkscs ['hkscs', 'big5_hkscs'] bz2_codec ['bz2'] cp037 ['ebcdic_cp_wt', 'ebcdic_cp_us', 'ebcdic_cp_nl', '037', 'ibm039', 'ibm037', 'csibm037', 'ebcdic_cp_ca'] cp1026 ['csibm1026', 'ibm1026', '1026'] cp1140 ['1140', 'ibm1140'] cp1250 ['1250', 'windows_1250'] cp1251 ['1251', 'windows_1251'] cp1252 ['windows_1252', '1252'] cp1253 ['1253', 'windows_1253'] cp1254 ['1254', 'windows_1254'] cp1255 ['1255', 'windows_1255'] cp1256 ['1256', 'windows_1256'] cp1257 ['1257', 'windows_1257'] cp1258 ['1258', 'windows_1258'] cp424 ['ebcdic_cp_he', 'ibm424', '424', 'csibm424'] cp437 ['ibm437', '437', 'cspc8codepage437'] cp500 ['csibm500', 'ibm500', '500', 'ebcdic_cp_ch', 'ebcdic_cp_be'] cp775 ['cspc775baltic', '775', 'ibm775'] cp850 ['ibm850', 'cspc850multilingual', '850'] cp852 ['ibm852', '852', 'cspcp852'] cp855 ['csibm855', 'ibm855', '855'] cp857 ['csibm857', 'ibm857', '857'] cp860 ['csibm860', 'ibm860', '860'] cp861 ['csibm861', 'cp_is', 'ibm861', '861'] cp862 ['cspc862latinhebrew', 'ibm862', '862'] cp863 ['csibm863', 'ibm863', '863'] cp864 ['csibm864', 'ibm864', '864'] cp865 ['csibm865', 'ibm865', '865'] cp866 ['csibm866', 'ibm866', '866'] cp869 ['csibm869', 'ibm869', '869', 'cp_gr'] cp932 ['mskanji', '932', 'ms932', 'ms_kanji'] cp949 ['uhc', 'ms949', '949'] cp950 ['ms950', '950'] euc_jis_2004 ['eucjis2004', 'jisx0213', 'euc_jis2004'] euc_jisx0213 ['eucjisx0213'] euc_jp ['eucjp', 'ujis', 'u_jis'] euc_kr ['ksc5601', 'korean', 'euckr', 'ksx1001', 'ks_c_5601', 'ks_c_5601_1987', 'ks_x_1001'] gb18030 ['gb18030_2000'] gb2312 ['chinese', 'euc_cn', 'csiso58gb231280', 'iso_ir_58', 'euccn', 'eucgb2312_cn', 'gb2312_1980', 'gb2312_80'] gbk ['cp936', 'ms936', '936'] hex_codec ['hex'] hp_roman8 ['csHPRoman8', 'r8', 'roman8'] hz ['hzgb', 'hz_gb_2312', 'hz_gb'] iso2022_jp ['iso2022jp', 'iso_2022_jp', 'csiso2022jp'] iso2022_jp_1 ['iso_2022_jp_1', 'iso2022jp_1'] iso2022_jp_2 ['iso_2022_jp_2', 'iso2022jp_2'] iso2022_jp_2004 ['iso_2022_jp_2004', 'iso2022jp_2004'] iso2022_jp_3 ['iso_2022_jp_3', 'iso2022jp_3'] iso2022_jp_ext ['iso2022jp_ext', 'iso_2022_jp_ext'] iso2022_kr ['iso_2022_kr', 'iso2022kr', 'csiso2022kr'] iso8859_10 ['csisolatin6', 'l6', 'iso_8859_10_1992', 'iso_ir_157', 'iso_8859_10', 'latin6'] iso8859_11 ['iso_8859_11', 'thai', 'iso_8859_11_2001'] iso8859_13 ['iso_8859_13'] iso8859_14 ['iso_celtic', 'iso_ir_199', 'l8', 'iso_8859_14_1998', 'iso_8859_14', 'latin8'] iso8859_15 ['iso_8859_15'] iso8859_16 ['iso_8859_16_2001', 'l10', 'iso_ir_226', 'latin10', 'iso_8859_16'] iso8859_2 ['l2', 'csisolatin2', 'iso_ir_101', 'iso_8859_2', 'iso_8859_2_1987', 'latin2'] iso8859_3 ['iso_8859_3_1988', 'l3', 'iso_ir_109', 'csisolatin3', 'iso_8859_3', 'latin3'] iso8859_4 ['csisolatin4', 'l4', 'iso_ir_110', 'iso_8859_4', 'iso_8859_4_1988', 'latin4'] iso8859_5 ['iso_8859_5_1988', 'iso_8859_5', 'cyrillic', 'csisolatincyrillic', 'iso_ir_144'] iso8859_6 ['iso_8859_6_1987', 'iso_ir_127', 'csisolatinarabic', 'asmo_708', 'iso_8859_6', 'ecma_114', 'arabic'] iso8859_7 ['ecma_118', 'greek8', 'iso_8859_7', 'iso_ir_126', 'elot_928', 'iso_8859_7_1987', 'csisolatingreek', 'greek'] iso8859_8 ['iso_8859_8_1988', 'iso_ir_138', 'iso_8859_8', 'csisolatinhebrew', 'hebrew'] iso8859_9 ['l5', 'iso_8859_9_1989', 'iso_8859_9', 'csisolatin5', 'latin5', 'iso_ir_148'] johab ['cp1361', 'ms1361'] koi8_r ['cskoi8r'] latin_1 ['iso8859', 'csisolatin1', 'latin', 'l1', 'iso_ir_100', 'ibm819', 'cp819', 'iso_8859_1', 'latin1', 'iso_8859_1_1987', '8859'] mac_cyrillic ['maccyrillic'] mac_greek ['macgreek'] mac_iceland ['maciceland'] mac_latin2 ['maccentraleurope', 'maclatin2'] mac_roman ['macroman'] mac_turkish ['macturkish'] mbcs ['dbcs'] ptcp154 ['cp154', 'cyrillic-asian', 'csptcp154', 'pt154'] quopri_codec ['quopri', 'quoted_printable', 'quotedprintable'] rot_13 ['rot13'] shift_jis ['s_jis', 'sjis', 'shiftjis', 'csshiftjis'] shift_jis_2004 ['shiftjis2004', 's_jis_2004', 'sjis_2004'] shift_jisx0213 ['shiftjisx0213', 'sjisx0213', 's_jisx0213'] tactis ['tis260'] tis_620 ['tis620', 'tis_620_2529_1', 'tis_620_2529_0', 'iso_ir_166', 'tis_620_0'] utf_16 ['utf16', 'u16'] utf_16_be ['utf_16be', 'unicodebigunmarked'] utf_16_le ['utf_16le', 'unicodelittleunmarked'] utf_7 ['u7', 'utf7'] utf_8 ['u8', 'utf', 'utf8_ucs4', 'utf8_ucs2', 'utf8'] uu_codec ['uu'] zlib_codec ['zlib', 'zip'] ---------------------------------------------------------------------- >Comment By: M.-A. Lemburg (lemburg) Date: 2005-08-06 14:49 Message: Logged In: YES user_id=38388 Martin, I don't see any problem with putting the complete list of aliases into the documentation. liturgist, don't worry about hard-coding things into the script. The extra information Martin gave in the table is not likely going to become part of the standard lib, because there's no a lot you can do with it programmatically. ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2005-08-06 14:41 Message: Logged In: YES user_id=21627 I would not like to see the documentation contain a complete list of all aliases. The documentation points out that this are "a few common aliases", ie. I selected aliases that people are likely to encounter, and are encouraged to use. I don't think it is useful to produce the table from the code. If you want to know everything in aliases, just look at aliases directly. ---------------------------------------------------------------------- Comment By: liturgist (liturgist) Date: 2005-08-05 19:53 Message: Logged In: YES user_id=197677 I would very much like to produce the doc table from code. However, I have a few questions. It seems that encodings.aliases.aliases is a list of all encodings and not necessarily those supported on all machines. Ie. mbcs on UNIX or embedded systems that might exclude some large character sets to save space. Is this correct? If so, will it remain that way? To find out if an encoding is supported on the current machine, the code should handle the exception generated when codecs.lookup() fails. Right? To generate the table, I need to produce the "Languages" field. This information does not seem to be available from the Python runtime. I would much rather see this information, including a localized version of the string, come from the Python runtime, rather than hardcode it into the script. Is that a possibility? Would it be a better approach? The non-language oriented encodings such as base_64 and rot_13 do not seem to have anything that distinguishes them from human languages. How can these be separated out without hardcoding? Likewise, the non-language encodings have an "Operand type" field which would need to be generated. My feeling is, again, that this should come from the Python runtime and not be hardcoded into the doc generation script. Any suggestions? ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-08-04 16:47 Message: Logged In: YES user_id=38388 Doc patches are welcome - perhaps you could enhance your script to have the doc table generated from the available codecs and aliases ?! Thanks. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1249749&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com