On Fri, 30 Jan 2009, Szak�ts Viktor wrote:

Hi,

> Stripping accents from a string is often needed when dealing with external
> systems
> accepting only ASCII alphabet, and is useful to prepare strings for
> comparison (in
> case of names f.e.).
> original: "aááäbcdeéfghiííjklmnoóóöőőpqrstuúúüűűvwxyz"
> stripped: "aaaabcdeefghiiijklmnoooooopqrstuuuuuuvwxyz"
> This functionality is missing from Harbour, yet it would be best integrated
> to the
> codepage subsystem. But before I do anything, I'd like to discuss it here.
> I was thinking adding two new strings to the HB_CODEPAGE structure,
> which would hold the necessary ASCII character equivalent of the current
> accented lowercase and uppercase char strings.
> Any comments?

This should be done in different way. Current codepage implementation
has few very serious limitations, f.e. it cannot translate characters
between languages, it's very impossible to define CP which does not have
corresponding upper case letters, it cannot translate multibyte characters,
etc.
If you add above strings to CP then you will have yet another thing
which works locally only for given codepage without any relations to
other CPs.
This should be implemented in different way.
We will need global unicode fallback table which will work for any CP
looking for corresponding character replacement in the destination CP
so it will be enough to make translation between any used CP and ASCII
CP (we will have to introduce such unicode table where all characters
which are not in range 32 <= x < 127 will be mapped to 0). Such fall
back table will work also with multibyte translations or if necessary
will replace single byte by multibyte phonetic sequence. f.e. personally
I'm using such feature translation texts in Cyrillic to Latin characters.

Meanwhile if you need such functionality then you can simply introduce
new CPs which will have only ASCII characters for given langauge with
some sufic like NONE or ASCII, f.e.:
   "PLNONE"
      [...]
      "AABCCDEEFGHIJKLLMNNOOPQRSSTUVWXYZZZ",
      "aabccdeefghijkllmnnoopqrsstuvwxyzzz",
      [...]

and then to strip accented characters you can make translations between
   "PL*" -> "PLNONE"
The same you can make for any other languages.
Sooner or later I will have to start serious modifications in our CP code
and I cannot promise that I'll be able to keep such CP extensions like
you proposed working.
The other solution is creating map from unicode table to Latin letters
and then using this map for translations. This can be done as separate
function and will also work for all langaues.
It could be table indexed by unicode U16 value with ASCII characters
or (if you want to introduce multibyte translations for languages which
do not have corresponding unaccented single characters in Latin alphabet)
with strings.
It will be limited to ASCII conversions fallback table I want to introduce.

best regards,
Przemek
_______________________________________________
Harbour mailing list
Harbour@harbour-project.org
http://lists.harbour-project.org/mailman/listinfo/harbour

Reply via email to