On Fri, 30 Jan 2009, Szak�ts Viktor wrote: Hi,
> Stripping accents from a string is often needed when dealing with external > systems > accepting only ASCII alphabet, and is useful to prepare strings for > comparison (in > case of names f.e.). > original: "aááäbcdeéfghiííjklmnoóóöőőpqrstuúúüűűvwxyz" > stripped: "aaaabcdeefghiiijklmnoooooopqrstuuuuuuvwxyz" > This functionality is missing from Harbour, yet it would be best integrated > to the > codepage subsystem. But before I do anything, I'd like to discuss it here. > I was thinking adding two new strings to the HB_CODEPAGE structure, > which would hold the necessary ASCII character equivalent of the current > accented lowercase and uppercase char strings. > Any comments? This should be done in different way. Current codepage implementation has few very serious limitations, f.e. it cannot translate characters between languages, it's very impossible to define CP which does not have corresponding upper case letters, it cannot translate multibyte characters, etc. If you add above strings to CP then you will have yet another thing which works locally only for given codepage without any relations to other CPs. This should be implemented in different way. We will need global unicode fallback table which will work for any CP looking for corresponding character replacement in the destination CP so it will be enough to make translation between any used CP and ASCII CP (we will have to introduce such unicode table where all characters which are not in range 32 <= x < 127 will be mapped to 0). Such fall back table will work also with multibyte translations or if necessary will replace single byte by multibyte phonetic sequence. f.e. personally I'm using such feature translation texts in Cyrillic to Latin characters. Meanwhile if you need such functionality then you can simply introduce new CPs which will have only ASCII characters for given langauge with some sufic like NONE or ASCII, f.e.: "PLNONE" [...] "AABCCDEEFGHIJKLLMNNOOPQRSSTUVWXYZZZ", "aabccdeefghijkllmnnoopqrsstuvwxyzzz", [...] and then to strip accented characters you can make translations between "PL*" -> "PLNONE" The same you can make for any other languages. Sooner or later I will have to start serious modifications in our CP code and I cannot promise that I'll be able to keep such CP extensions like you proposed working. The other solution is creating map from unicode table to Latin letters and then using this map for translations. This can be done as separate function and will also work for all langaues. It could be table indexed by unicode U16 value with ASCII characters or (if you want to introduce multibyte translations for languages which do not have corresponding unaccented single characters in Latin alphabet) with strings. It will be limited to ASCII conversions fallback table I want to introduce. best regards, Przemek _______________________________________________ Harbour mailing list Harbour@harbour-project.org http://lists.harbour-project.org/mailman/listinfo/harbour