Re: [GENERAL] Second byte of multibyte characters causing trouble

Karen Ellrick Thu, 20 Sep 2001 22:37:03 -0700
> The usual way to deal with this is to convert the J text from
> S-JIS (which
> will almost always cause problems) to either EUC-JP or UTF8
> encoding before
> inserting it into the DB or otherwise messing with it. You can
> then convert
> it back to SJIS before sending it to the client.

After reading this, I started thinking in terms of character sets and dug a
little more, and lo-and-behold, I discovered that our installation of
PostgreSQL was configured with "--enable-multibyte=EUC_JP".  No wonder I'm
having problems!  Okay, I'm convinced.

Now first I have to convert my existing data, which although sitting in a
database that expects EUC, is actually SJIS-based text.  I found the
following series of bash commands in a Japanese mailing list archive - does
it look like this will work for me?  (It looks scary to just drop the whole
database and hope that the .out file knows how to rebuild it with all the
indexes, sequences, users, etc. in place - should I be nervous?)
$ pg_dump -D dbname > db.out
$ dropdb dbname
$ createdb -E EUC_JP dbname
$ export PGCLIENTENCODING=SJIS
$ psql dbname < db.out
$ export PGCLIENTENCODING=EUC_JP

Regarding the user interface end, when I read the suggested solution of
using jcode to convert everything in and out of the database, I thought,
"That's tedious!  Why not just use EUC on the web pages, and the whole
system will be in sync?"  But that seems to be almost as tedious.  The
Windows-based editor I normally use to input the Japanese text portions of
my code (I do most of the work in vi on my Linux box, but I can't input the
Japanese that way) reads and writes in Shift-JIS unless I use pre- and
post-processing filters, and it seems that other Windows programs also favor
Shift-JIS.  I did a totally unofficial, very-small-data-sample survey of
Japanese web sites, and it seems that in general, sites that deal with
ordinary consumers (and likely are written on Microsoft machines) use
Shift-JIS (even ones that I figure must use databases, like search engines
and e-commerce), Linux-related sites use JIS, and PostgreSQL-related sites
use EUC.  I'm sure there's a grand story to explain how it got to be this
messy, but for right now, I guess we have to live with all these different
systems - apparently there is not one system that works nicely for all
things, or else the others would gradually become obselete, right?

Before I add jcode function calls for every piece of data I get in or out of
the database, or convert all my web page text to EUC-JP (I haven't decided
yet which approach is more work, or more of a problem to maintain), are
there any other thoughts on this?  For example, does someone know of one of
the following: (a) a way to get the text-only console of a RedHat 6.1J box
to actually display Japanese characters (if so, I not only wouldn't have to
deal with the Windows box for editing, I could even read the output of
queries in psql!), or (b) a text editor for Windows that can be configured
to default to EUC, rather than having to remember to always select a filter
to convert to and from Shift-JIS?  Or on the flip side of the discussion,
can anyone imagine pitfalls associated with having a web site that is half
EUC (the PHP and Perl files that deal with the database) and half Shift-JIS
(the static HTML pages that are written by other people in who-knows-what
Windows-based tools)?

Thanks,
--------------------------------
Karen Ellrick
S & C Technology, Inc.
1-21-35 Kusatsu-shinmachi
Hiroshima  733-0834  Japan
(from U.S. 011-81, from Japan 0) 82-293-2838
--------------------------------


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html
Re: [GENERAL] Second byte of multibyte characters causing trouble

Reply via email to