On Monday, 9 August 2004 at 4:14 AM +1000, Dan Sugalski wrote: >Since this has been a sore spot lately, and one >we need to deal with. Might as well formally >define what that is. > >We must be able to: > >*) Load in string data from an IO source, >regardless of its encoding, and treat it as >Unicode string data >*) write string data to an IO source in any Unicode encoding >*) Collate strings per the Unuicode standard >*) Convert non-Unicode string data to Unicode >properly (that is, obeying the Unicode conversion >rules) >*) Treat combining characters the same regardless >of whether they're composed or decomposed > >We don't care about on-screen rendering or date/time/money formatting. > >So, basically, we need to be able to read in data >regardless of whether it's UTF-8, UTF-16, or >UTF-32 encoded, and when we have it we should be >able to properly match "o" against "o" and not >"ö" (that's o with an umlaut over it) regardless >of whether the "ö" is composed (that is, one >codepoint) or decomposed (that is, two code >points), and then write it out to some IO handle >in proper UTF-8/16/32 format. When comparing two >Unicode strings we must be able to do so >properly, per the Unicode collation standard. >(With potential local overrides if we ever put >those in) We must also be able to case-mangle >(that is, upcase, downcase, or titlecase) the >string. > >Additionally if we have source text which is >Latin-n, EBCDIC, ASCII, or whatever we must be >able to convert it with no loss to Unicode. >(Which I believe is now doable with Unicode 4.0) >Losslessly converting Unicode to >ASCII/EBCDIC/whatever is *not* required, which is >fine as it's theoretically (and often >practically) impossible. > >I think that's it. Spelling it out's made the >encoding and charset API clear. I'll type that in >and get it off next.
Hi Dan, I've been lurking on this list for a while. I use unicode a lot since I specialise in building multi-lingual apps. We're currently mid-shift to using Perl as our main dev language after using Lasso for many years. Your Unicode specs sound spot on, and it'll be great to see Unicode "done right". I'm not sure how you plan to integrate the database level (or whether it affects what you are doing at all), but presumably you know all about the new encoding and collation sets in mySQL 4.1. Things have changed quite a bit there from 4.0, and I've seen a few issues with various middleware handling those changes. There's a ton of info on it at <http://dev.mysql.com/doc/mysql/en/Charset.html> - Adam ~~~~ Adam Richardson CEO, Primogen Software http://www.primogensoftware.com Security Geek, FiveGeeks http://www.fivegeeks.com Primogen Software is a privately owned software development company specialising in the development of high security online applications and high security database storage strategies. We combine databases like mySQL and Oracle with Perl, Mason, and Lasso database middleware to produce intelligent, adaptive database driven business intranet and internet applications. We also provide a range of data security services including penetration testing, application source code audits and network security audits with full compliance with the remote auditing and testing requirements of ISO 17799 (BS7799) and ISO 17799-2000 for information security testing. Primogen Software is a division of Waenick Pty Ltd