Re: What Unicode means to us

Adam Richardson Tue, 10 Aug 2004 06:56:50 -0700

On Monday, 9 August 2004 at 4:14 AM +1000, Dan Sugalski wrote:

>Since this has been a sore spot lately, and one
>we need to deal with. Might as well formally
>define what that is.
>
>We must be able to:
>
>*) Load in string data from an IO source,
>regardless of its encoding, and treat it as
>Unicode string data
>*) write string data to an IO source in any Unicode encoding
>*) Collate strings per the Unuicode standard
>*) Convert non-Unicode string data to Unicode
>properly (that is, obeying the Unicode conversion
>rules)
>*) Treat combining characters the same regardless
>of whether they're composed or decomposed
>
>We don't care about on-screen rendering or date/time/money formatting.
>
>So, basically, we need to be able to read in data
>regardless of whether it's  UTF-8, UTF-16, or
>UTF-32 encoded, and when we have it we should be
>able to properly match "o" against "o" and not
>"ö" (that's o with an umlaut over it) regardless
>of whether the "ö" is composed (that is, one
>codepoint) or decomposed (that is, two code
>points), and then write it out to some IO handle
>in proper UTF-8/16/32 format. When comparing two
>Unicode strings we must be able to do so
>properly, per the Unicode collation standard.
>(With potential local overrides if we ever put
>those in) We must also be able to case-mangle
>(that is, upcase, downcase, or titlecase) the
>string.
>
>Additionally if we have source text which is
>Latin-n, EBCDIC, ASCII, or whatever we must be
>able to convert it with no loss to Unicode.
>(Which I believe is now doable with Unicode 4.0)
>Losslessly converting Unicode to
>ASCII/EBCDIC/whatever is *not* required, which is
>fine as it's theoretically (and often
>practically) impossible.
>
>I think that's it. Spelling it out's made the
>encoding and charset API clear. I'll type that in
>and get it off next.


Hi Dan,

I've been lurking on this list for a while. I use unicode a lot since I
specialise in building multi-lingual apps. We're currently mid-shift to
using Perl as our main dev language after using Lasso for many years.

Your Unicode specs sound spot on, and it'll be great to see Unicode "done
right".

I'm not sure how you plan to integrate the database level (or whether it
affects what you are doing at all), but presumably you know all about the
new encoding and collation sets in mySQL 4.1. Things have changed quite a
bit there from 4.0, and I've seen a few issues with various middleware
handling those changes.

There's a ton of info on it at
<http://dev.mysql.com/doc/mysql/en/Charset.html>

- Adam


~~~~

Adam Richardson

CEO, Primogen Software
http://www.primogensoftware.com

Security Geek, FiveGeeks
http://www.fivegeeks.com

Primogen Software is a privately owned software development
company specialising in the development of high security online
applications and high security database storage strategies.

We combine databases like mySQL and Oracle with Perl,
Mason, and Lasso database middleware to produce intelligent,
adaptive database driven business intranet and internet applications.

We also provide a range of data security services including penetration
testing, application source code audits and network security audits
with full compliance with the remote auditing and testing requirements
of ISO 17799 (BS7799) and ISO 17799-2000 for information security
testing.

Primogen Software is a division of Waenick Pty Ltd

Re: What Unicode means to us

Reply via email to