Re: unicode and dbf files

Ethan Furman Mon, 26 Oct 2009 09:37:57 -0700

John Machin wrote:

On Oct 24, 4:14 am, Ethan Furman <et...@stoneleaf.us> wrote:

John Machin wrote:

On Oct 23, 3:03 pm, Ethan Furman <et...@stoneleaf.us> wrote:

John Machin wrote:

On Oct 23, 7:28 am, Ethan Furman <et...@stoneleaf.us> wrote:

Greetings, all!

I would like to add unicode support to my dbf project.  The dbf header
has a one-byte field to hold the encoding of the file.  For example,
\x03 is code-page 437 MS-DOS.

My google-fu is apparently not up to the task of locating a complete
resource that has a list of the 256 possible values and their
corresponding code pages.

What makes you imagine that all 256 possible values are mapped to code
pages?

I'm just wanting to make sure I have whatever is available, and
preferably standard.  :D

So far I have found this, plus variations:http://support.microsoft.com/kb/129631

Does anyone know of anything more complete?

That is for VFP3. Try the VFP9 equivalent.

dBase 5,5,6,7 use others which are not defined in publicly available
dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
source: ESRI support site.

Well, a couple hours later and still not more than I started with.
Thanks for trying, though!

Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
keywords and you couldn't come up with anything??


Perhaps "nothing new" would have been a better description.  I'd already
seen the clicketyclick site (good info there)



Do you think so? My take is that it leaves out most of the codepage
numbers, and these two lines are wrong:
65h     Nordic MS-DOS   code page 865
66h     Russian MS-DOS  code page 866

That was the site I used to get my whole project going, so ignoring theunicode aspect, it has been very helpful to me.

and all I found at ESRI
were folks trying to figure it out, plus one link to a list that was no
different from the vfp3 list (or was it that the list did not give the
hex values?  Either way, of no use to me.)



Try this:
http://webhelp.esri.com/arcpad/8.0/referenceguide/

Wow. Question, though: all those codepages mapping to 437 and 850 --are they really all the same?

I looked at dbase.com, but came up empty-handed there (not surprising,
since they are a commercial company).



MS and ESRI have docs ... does that mean that they are non-commercial
companies?

I don't know enough about ESRI to make an informed comment, so I'll justsay I'm grateful they have them! MS is a complete mystery... perhapsthey are finally seeing the light? Hard to believe, though, from acompany that has consistently changed their file formats with every release.

I searched some more on Microsoft's site in the VFP9 section, and was
able to find the code page section this time.  Sadly, it only added
about seven codes.

At any rate, here is what I have come up with so far.  Any corrections
and/or additions greatly appreciated.

code_pages = {
    '\x01' : ('ascii', 'U.S. MS-DOS'),



All of the sources say codepage 437, so why ascii instead of cp437?


Hard to say, really.  Adjusted.

    '\x02' : ('cp850', 'International MS-DOS'),
    '\x03' : ('cp1252', 'Windows ANSI'),
    '\x04' : ('mac_roman', 'Standard Macintosh'),
    '\x64' : ('cp852', 'Eastern European MS-DOS'),
    '\x65' : ('cp866', 'Russian MS-DOS'),
    '\x66' : ('cp865', 'Nordic MS-DOS'),
    '\x67' : ('cp861', 'Icelandic MS-DOS'),
    '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'),     # iffy



Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
not alone. I suggest that you omit Kamenicky until someone actually
wants it.

Yeah, I noticed that. Tentative plan was to implement it myself (morefor practice than anything else), and also to be able to raise a morespecific error ("Kamenicky not currently supported" or some such).

    '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'),      # iffy



Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia
predates and is not the same as cp852. In any case, I suggest that you
omit Masovia until someone wants it. Interesting reading:

http://www.jastra.com.pl/klub/ogonki.htm


Very interesting reading.

    '\x6a' : ('cp737', 'Greek MS-DOS (437G)'),
    '\x6b' : ('cp857', 'Turkish MS-DOS'),
    '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\



big5 is *not* the same as cp950. The products that create DBF files
were designed for Windows. So when your source says that LDID 0xXX
maps to Windows codepage YYY, I would suggest that all you should do
is translate that without thinking to python encoding cpYYY.


Ack.  Not sure how I missed 'Windows' at the end of that description.

               Windows'),       # wag


What does "wag" mean?


wag == 'wild ass guess'

    '\x79' : ('iso2022_kr', 'Korean Windows'),          # wag


Try cp949.


Done.

    '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
               Windows'),       # wag



Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic
(1980) Chinese (GB2312) and a basic Korean kit. However to quote from
"CJKV Information Processing" by Ken Lunde, "... from a practical
point of view, ISO-2022-JP-2 ..... [is] equivalent to ISO-2022-JP-1
encoding." i.e. no Chinese support at all. Try cp936.


Done.

    '\x7b' : ('iso2022_jp', 'Japanese Windows'),        # wag



Try cp936.


You mean 932?

    '\x7c' : ('cp874', 'Thai Windows'),                 # wag
    '\x7d' : ('cp1255', 'Hebrew Windows'),
    '\x7e' : ('cp1256', 'Arabic Windows'),
    '\xc8' : ('cp1250', 'Eastern European Windows'),
    '\xc9' : ('cp1251', 'Russian Windows'),
    '\xca' : ('cp1254', 'Turkish Windows'),
    '\xcb' : ('cp1253', 'Greek Windows'),
    '\x96' : ('mac_cyrillic', 'Russian Macintosh'),
    '\x97' : ('mac_latin2', 'Macintosh EE'),
    '\x98' : ('mac_greek', 'Greek Macintosh') }



HTH,
John

Very helpful indeed. Many thanks for reviewing and correcting.Learning to deal with unicode is proving more difficult for me thanlearning Python was to begin with! ;D


~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode and dbf files

Reply via email to