On Wed, Feb 24, 2021 at 12:22 PM José Mejuto via lazarus < lazarus@lists.lazarus-ide.org> wrote:
> In my code there is non 100% unicode compatibility when using the > "CaseInsensitive" mode as as it uses lowercase mask and lowercase string > to perform the test which is wrong by definition but I was unable to > find a method to test codepoints case insensitive without pulling in big > unicode tables. > > I was thinking in import the NTFS (the filesystem) case comparison > tables which are 128 KB "only". > That is not necessary. LazUTF8 has functions like UTF8CompareText(), UTF8CompareTextP() and the latest UTF8CompareLatinTextFast(). UTF8CompareLatinTextFast supports full Unicode but is optimized for mostly Latin text. We should add a PChar version UTF8CompareLatinTextFastP() and use it in your mask code. > Comprehensive unit tests are a way to prevent breaking things. > > And also define if a compatibility break is a bug in the new code or in > the old code. In example my mask supports (there is a define to disable) > "[z-a]" converting it to "[a-z]" which is a compatibility break. Your code does not compile when RANGES_AUTOREVERSE is not defined. cMask is not found. The reverse logic can be enabled by default. It does not break anybody's masks as I understand it. Earlier it was an error, now it does something sensible. Also there is the support (also can be disabled) for the mask "[?]" > which is the counterpart for "*" but with one char position. > Where did you get this "[?]" syntax? There must be a reference documentation somewhere but I have not seen it. What is the difference between "?" and "[?]" ? On Wed, Feb 24, 2021 at 1:28 PM José Mejuto via lazarus < lazarus@lists.lazarus-ide.org> wrote: > > Sometimes I wish we would migrate to using UnicodeString by default. > > It would make life a bit easier. > > (And yes I know you would have to deal with composed characters > > (grapheme defined by more than 1 16-bit word)). > > That's a can of worms! UTF8 forces you to write "correct code" (at least > try it) for any character >127, with UnicodeString you get the false > apparence that everything magically works until everything cracks when a > string with surrogate pairs come in play :-) and ALL you text handling > must be rewritten, and most of them completly rewritten. > Exactly. UnicodeString uses UTF-16 which is also a variable length encoding. The same rules should be applied but often they are not. There is plenty of sloppy UTF-16 code out there. Writing proper code UTF-8 is not difficult once you wrap your mind around the concept. There is a learning curve, true. I also scratched my head for some time when studying it. Juha
-- _______________________________________________ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus