Re: I'm looking for a method of converting a string's character encoding

Sunjoong Lee Sat, 28 Apr 2012 15:43:03 -0700

Thanks hien-Thi, Daniel and Eli.

Eli pointed a good example; I'll say another one. In the countries, it's
character encoded multibytes, like China, Japan and Korea (i.e., in CJKs),
it would be a common issue to convert codeset. In Korea, a certain web page
may be written by EUC-KR codeset and another by UTF-8. In Japan, Shift-JIS,
EUC-JP, ISO-2022-JP and UTF-8. In China, GBK, gb18030, Big5, Big5-HKSCS and
UTF-8. I mean that koreans use 2 different codesets, japanese 4, chinese 5
in the net.


It seems not to happen comparing chinese web page and korean web page with
a same program but... Suppose you want to write a program monitoring web
pages, the codeset converter would be need. Just in CJKs? Greeks use 3
codesets, vietnamese 2, arabs 3, and so on. It looks like that russians use
many codesets like chinese.

2012/4/29 Eli Zaretskii <[email protected]>

> > Date: Sat, 28 Apr 2012 20:29:22 +0200
> > From: Daniel Krueger <[email protected]>
> > Cc: [email protected], Sunjoong Lee <[email protected]>
> >
> > i think there shouldn't be any transcoding of guile's strings, as
> > strings are internal representation of characters, no matter how they
> > are encoded. So the only time when encoding matters is when it passes
> > it's `internal boundarys', i mean if you write the string to a port or
> > read from a port or pass it as a string to a foreign library. For the
> > ports all transcoding is available, and as said, the real
> > representation of guile strings internally is as utf8, which can't be
> > changed. The only additional thing i forgot about are bytevectors, if
> > you convert a string to an explicit representation, but afaik there
> > you also can give the encoding to use.
> >
> > Am I wrong?
>
> You are mostly right, but only "mostly".  Experience teaches that
> sometimes you need to change encoding even inside "the boundaries".
> One notable example is when the original encoding was determined
> incorrectly, and the application wants to "re-decode" the string, when
> its external origin is no longer available.  Another example is an
> application that wants to convert an encoded string into base-64 (or
> similar) form -- you'll need to encode the string internally first.
>
> These kinds of rare, but still important, use cases are the reason why
> Emacs Lisp has primitives to do encoding and decoding of in-memory
> strings; as much as Emacs maintainers want to get rid of the related
> need to support "unibyte strings", they are not going to go away any
> time soon.
>
> IOW, Guile needs a way to represent a string encoded in something
> other than UTF-8, and convert between UTF-8 and other encodings.
>

Re: I'm looking for a method of converting a string's character encoding

Reply via email to