Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters
@Luiz Americo Your code WideCompareText(UTF8Decode(Key), UTF8Decode(Str)) will work, but if speed matters, then it's rather bad. I've tried to make a faster function for UTF-8: uses unicodeinfo, LCLProc; function UTF8CompareText(s1, s2: UTF8String): Integer; var u1, u2: Ucs4Char; u1l, u2l: longint; BytePos1, Len1, SLen1: integer; BytePos2, Len2, SLen2: integer; begin Result := 0; BytePos1 := 1; BytePos2 := 1; SLen1 := System.Length(s1); SLen2 := System.Length(s2); if SLen1 <> SLen2 then //Assuming lower/uppercase representations have the same byte length begin if SLen1 > SLen2 then Result := 1 else Result := -1; exit; end; repeat u1 := UTF8CharacterToUnicode(@s1[BytePos1], Len1); inc(BytePos1, Len1); u2 := UTF8CharacterToUnicode(@s2[BytePos2], Len2); inc(BytePos2, Len2); if u1 <> u2 then begin {$IFDEF useunicodinfo} u1l := unicodeinfo.utf8proc_get_property(u1)^.lowercase_mapping; if u1l <> -1 then u1 := u1l; u2l := unicodeinfo.utf8proc_get_property(u2)^.lowercase_mapping; if u2l <> -1 then u2 := u2l; {$ELSE} u1 := UCS4Char(WideUpperCase(WideChar(u1))[1]); u2 := UCS4Char(WideUpperCase(WideChar(u2))[1]); {$ENDIF} if u1 <> u2 then begin Result := u1 - u2; exit; end; end; until (BytePos1 > SLen1) or (BytePos2 > SLen2) end; Some numbers for my system (Linux) where WideCompareText is the function you use now, WideUppercase is the above function and unicodeinfo is the above function with useunicodinfo defined. See here http://wiki.lazarus.freepascal.org/Theodp Comparing identical Strings of 322 Chars 1 times WideCompareText: 785ms unicodeinfo: 75ms WideUpperCase: 74ms Comparing Strings of 322 Chars 1 times where the 3rd char differs WideCompareText: 268ms unicodeinfo: 3ms WideUpperCase: 8ms Comparing identical Text of 322 Chars 1 times where one Text is all uppercase WideCompareText: 810ms unicodeinfo: 121ms WideUpperCase: 1076ms Regards Theo ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters
On 25 Jul 2009, at 17:46, theo wrote: if SLen1 <> SLen2 then //Assuming lower/uppercase representations have the same byte length That is a wrong assumption. E.g., the lowercase version of I (uppercase i, a single byte) in Turkish is ı (an "i" without a dot, definitely not a single-byte character). Jonas___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters
theo escreveu: @Luiz Americo Your code WideCompareText(UTF8Decode(Key), UTF8Decode(Str)) will work, but if speed matters, then it's rather bad. Hi, i'm aware that the performance is bad although had not tested like you did, but at this point i'd like to stick with a solution that fpc provides natively since it's being used in a fpc component (TSqlite3Dataset). In last revision i switched to the ansi version of the functions to save the conversion of the Key at each comparison. See http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/packages/fcl-db/src/sqlite/customsqliteds.pas?view=log#rev13431 Anyway is clear that functions to handle UTF8 and unicode in general is missing in fpc... I've tried to make a faster function for UTF-8: ... maybe your function can be used as a base to future development. Add a new function to the widestringmanager? Luiz uses unicodeinfo, LCLProc; function UTF8CompareText(s1, s2: UTF8String): Integer; var u1, u2: Ucs4Char; u1l, u2l: longint; BytePos1, Len1, SLen1: integer; BytePos2, Len2, SLen2: integer; begin Result := 0; BytePos1 := 1; BytePos2 := 1; SLen1 := System.Length(s1); SLen2 := System.Length(s2); if SLen1 <> SLen2 then //Assuming lower/uppercase representations have the same byte length begin if SLen1 > SLen2 then Result := 1 else Result := -1; exit; end; repeat u1 := UTF8CharacterToUnicode(@s1[BytePos1], Len1); inc(BytePos1, Len1); u2 := UTF8CharacterToUnicode(@s2[BytePos2], Len2); inc(BytePos2, Len2); if u1 <> u2 then begin {$IFDEF useunicodinfo} u1l := unicodeinfo.utf8proc_get_property(u1)^.lowercase_mapping; if u1l <> -1 then u1 := u1l; u2l := unicodeinfo.utf8proc_get_property(u2)^.lowercase_mapping; if u2l <> -1 then u2 := u2l; {$ELSE} u1 := UCS4Char(WideUpperCase(WideChar(u1))[1]); u2 := UCS4Char(WideUpperCase(WideChar(u2))[1]); {$ENDIF} if u1 <> u2 then begin Result := u1 - u2; exit; end; end; until (BytePos1 > SLen1) or (BytePos2 > SLen2) end; Some numbers for my system (Linux) where WideCompareText is the function you use now, WideUppercase is the above function and unicodeinfo is the above function with useunicodinfo defined. See here http://wiki.lazarus.freepascal.org/Theodp Comparing identical Strings of 322 Chars 1 times WideCompareText: 785ms unicodeinfo: 75ms WideUpperCase: 74ms Comparing Strings of 322 Chars 1 times where the 3rd char differs WideCompareText: 268ms unicodeinfo: 3ms WideUpperCase: 8ms Comparing identical Text of 322 Chars 1 times where one Text is all uppercase WideCompareText: 810ms unicodeinfo: 121ms WideUpperCase: 1076ms Regards Theo ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters
>> if SLen1 <> SLen2 then //Assuming lower/uppercase representations >> have the same byte length > > That is a wrong assumption. E.g., the lowercase version of I > (uppercase i, a single byte) in Turkish is ı (an "i" without a dot, > definitely not a single-byte character). OK thanks. That's why I added the comment, because I was not sure ;-) So then one should compare UTF8Lengths or probably forget that shortcut, because calculating UTF8Lengths is not cheap. Do turkish systems behave differently for WideLowerCase('I')? Will they return $0131 instead of $0069 ? Regards Theo ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters
On 25 Jul 2009, at 19:03, theo wrote: Do turkish systems behave differently for WideLowerCase('I')? Will they return $0131 instead of $0069 ? They should, since the uppercase version of i is İ there (i.e., a capital I with a dot on top). See e.g. http://www.i18nguy.com/unicode/turkish-i18n.html Jonas___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re[2]: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters
Hello FPC-Pascal, Saturday, July 25, 2009, 5:46:39 PM, you wrote: t> @Luiz Americo t> Your code t> WideCompareText(UTF8Decode(Key), UTF8Decode(Str)) t> will work, but if speed matters, then it's rather bad. That's not right, the assumption that: lowercasemapping(a)=lowercasemapping(b) is the same as: IsSameText(a,b) is wrong at unicode levels. -- Best regards, JoshyFun ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re[2]: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters
Hello FPC-Pascal, Thursday, July 23, 2009, 2:02:38 PM, you wrote: LAPC> Hi, i'm aware that the performance is bad although had not tested like LAPC> you did, but at this point i'd like to stick with a solution that fpc LAPC> provides natively since it's being used in a fpc component LAPC> (TSqlite3Dataset). Write unicode functions in UTF8 is almost non-sense, most unicode operations are not like we are used in the ANSI world, in unicode also there are a language context as in example in spanish 'á' renders to uppercase 'Á' but in other languages they are different letters. There are some functions named "general case" which perform a reasonable job for most used languages and only introduce errors in non widespread ones. I have some implementations for the general case, not heavily tested, like sametext, upper, lower and a bit more. The code is not optimized but if somebody wants to use them please ask :) The case of the SameText is specially CPU consumer as each string must be transformed several times before the comparation is some complex characters are present. -- Best regards, JoshyFun ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] cocoa usage
> I am trying to get into cocoa... > using lazarus from svn TRUNK I get some files in lcl/units/i386-darwin/cocoa remove these compiled units files. > can someone clarify what to use? Shell I copy the lazarus-cc stuff into the > lcl stuff above? you don't need to copy from ccr to lcl. upload cocoa units from lazarus-ccr you should find cocoa.lpk at lazarus-ccr/bindings/pascocoa/ dir. install the package. after the package is installed you can rebuild cocoa-LCL. don't expect too much from the widgetset. thanks, dmitry ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] cocoa usage
Installing the cocoa package, should do the search path for you. so you don't need to modify fpc.cfg or add compiler options to a lazarus project If you don't use cocoa the package, then you need to add units search paths manually. In any way you find it easier (modifying fpc.cfg or lazarus project compiler options). thanks, dmitry ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal