On Oct 24, 4:14 am, Ethan Furman <et...@stoneleaf.us> wrote: > John Machin wrote: > > On Oct 23, 3:03 pm, Ethan Furman <et...@stoneleaf.us> wrote: > > >>John Machin wrote: > > >>>On Oct 23, 7:28 am, Ethan Furman <et...@stoneleaf.us> wrote: > > >>>>Greetings, all! > > >>>>I would like to add unicode support to my dbf project. The dbf header > >>>>has a one-byte field to hold the encoding of the file. For example, > >>>>\x03 is code-page 437 MS-DOS. > > >>>>My google-fu is apparently not up to the task of locating a complete > >>>>resource that has a list of the 256 possible values and their > >>>>corresponding code pages. > > >>>What makes you imagine that all 256 possible values are mapped to code > >>>pages? > > >>I'm just wanting to make sure I have whatever is available, and > >>preferably standard. :D > > >>>>So far I have found this, plus > >>>>variations:http://support.microsoft.com/kb/129631 > > >>>>Does anyone know of anything more complete? > > >>>That is for VFP3. Try the VFP9 equivalent. > > >>>dBase 5,5,6,7 use others which are not defined in publicly available > >>>dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary > >>>source: ESRI support site. > > >>Well, a couple hours later and still not more than I started with. > >>Thanks for trying, though! > > > Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search > > keywords and you couldn't come up with anything?? > > Perhaps "nothing new" would have been a better description. I'd already > seen the clicketyclick site (good info there)
Do you think so? My take is that it leaves out most of the codepage numbers, and these two lines are wrong: 65h Nordic MS-DOS code page 865 66h Russian MS-DOS code page 866 > and all I found at ESRI > were folks trying to figure it out, plus one link to a list that was no > different from the vfp3 list (or was it that the list did not give the > hex values? Either way, of no use to me.) Try this: http://webhelp.esri.com/arcpad/8.0/referenceguide/ > > I looked at dbase.com, but came up empty-handed there (not surprising, > since they are a commercial company). MS and ESRI have docs ... does that mean that they are non-commercial companies? > I searched some more on Microsoft's site in the VFP9 section, and was > able to find the code page section this time. Sadly, it only added > about seven codes. > > At any rate, here is what I have come up with so far. Any corrections > and/or additions greatly appreciated. > > code_pages = { > '\x01' : ('ascii', 'U.S. MS-DOS'), All of the sources say codepage 437, so why ascii instead of cp437? > '\x02' : ('cp850', 'International MS-DOS'), > '\x03' : ('cp1252', 'Windows ANSI'), > '\x04' : ('mac_roman', 'Standard Macintosh'), > '\x64' : ('cp852', 'Eastern European MS-DOS'), > '\x65' : ('cp866', 'Russian MS-DOS'), > '\x66' : ('cp865', 'Nordic MS-DOS'), > '\x67' : ('cp861', 'Icelandic MS-DOS'), > '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy Indeed iffy. Python doesn't have a cp895 encoding, and it's probably not alone. I suggest that you omit Kamenicky until someone actually wants it. > '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'), # iffy Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia predates and is not the same as cp852. In any case, I suggest that you omit Masovia until someone wants it. Interesting reading: http://www.jastra.com.pl/klub/ogonki.htm > '\x6a' : ('cp737', 'Greek MS-DOS (437G)'), > '\x6b' : ('cp857', 'Turkish MS-DOS'), > '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\ big5 is *not* the same as cp950. The products that create DBF files were designed for Windows. So when your source says that LDID 0xXX maps to Windows codepage YYY, I would suggest that all you should do is translate that without thinking to python encoding cpYYY. > Windows'), # wag What does "wag" mean? > '\x79' : ('iso2022_kr', 'Korean Windows'), # wag Try cp949. > '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\ > Windows'), # wag Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic (1980) Chinese (GB2312) and a basic Korean kit. However to quote from "CJKV Information Processing" by Ken Lunde, "... from a practical point of view, ISO-2022-JP-2 ..... [is] equivalent to ISO-2022-JP-1 encoding." i.e. no Chinese support at all. Try cp936. > '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag Try cp936. > '\x7c' : ('cp874', 'Thai Windows'), # wag > '\x7d' : ('cp1255', 'Hebrew Windows'), > '\x7e' : ('cp1256', 'Arabic Windows'), > '\xc8' : ('cp1250', 'Eastern European Windows'), > '\xc9' : ('cp1251', 'Russian Windows'), > '\xca' : ('cp1254', 'Turkish Windows'), > '\xcb' : ('cp1253', 'Greek Windows'), > '\x96' : ('mac_cyrillic', 'Russian Macintosh'), > '\x97' : ('mac_latin2', 'Macintosh EE'), > '\x98' : ('mac_greek', 'Greek Macintosh') } HTH, John -- http://mail.python.org/mailman/listinfo/python-list