date:20090725

Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

2009-07-25 Thread theo

@Luiz Americo

Your code
WideCompareText(UTF8Decode(Key), UTF8Decode(Str))
will work, but if speed matters, then it's rather bad.

I've tried to make a faster function for UTF-8:

uses unicodeinfo, LCLProc;

function UTF8CompareText(s1, s2: UTF8String): Integer;
var u1, u2: Ucs4Char;
  u1l, u2l: longint;
  BytePos1, Len1, SLen1: integer;
  BytePos2, Len2, SLen2: integer;
begin
  Result := 0;
  BytePos1 := 1;
  BytePos2 := 1;
  SLen1 := System.Length(s1);
  SLen2 := System.Length(s2);

  if SLen1 <> SLen2 then  //Assuming lower/uppercase representations
have the same byte length
  begin
if SLen1 > SLen2 then Result := 1 else Result := -1;
exit;
  end;

  repeat
u1 := UTF8CharacterToUnicode(@s1[BytePos1], Len1);
inc(BytePos1, Len1);
u2 := UTF8CharacterToUnicode(@s2[BytePos2], Len2);
inc(BytePos2, Len2);
if u1 <> u2 then
begin
  {$IFDEF useunicodinfo}
  u1l := unicodeinfo.utf8proc_get_property(u1)^.lowercase_mapping;
  if u1l <> -1 then u1 := u1l;
  u2l := unicodeinfo.utf8proc_get_property(u2)^.lowercase_mapping;
  if u2l <> -1 then u2 := u2l;
  {$ELSE}
  u1 := UCS4Char(WideUpperCase(WideChar(u1))[1]);
  u2 := UCS4Char(WideUpperCase(WideChar(u2))[1]);
  {$ENDIF}
  if u1 <> u2 then
  begin
Result := u1 - u2;
exit;
  end;
end;
  until (BytePos1 > SLen1) or (BytePos2 > SLen2)
end;


Some numbers for my system (Linux) where WideCompareText is the function
you use now, WideUppercase is the above function and unicodeinfo is
the above function with useunicodinfo defined. See here
http://wiki.lazarus.freepascal.org/Theodp


Comparing identical Strings of 322 Chars 1 times
WideCompareText: 785ms
unicodeinfo: 75ms
WideUpperCase: 74ms

Comparing Strings of 322 Chars 1 times where the 3rd char differs
WideCompareText: 268ms
unicodeinfo: 3ms
WideUpperCase: 8ms

Comparing identical Text of 322 Chars 1 times where one Text is all
uppercase
WideCompareText: 810ms
unicodeinfo: 121ms
WideUpperCase: 1076ms

Regards Theo

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

2009-07-25 Thread Jonas Maebe



On 25 Jul 2009, at 17:46, theo wrote:


 if SLen1 <> SLen2 then  //Assuming lower/uppercase representations
have the same byte length


That is a wrong assumption. E.g., the lowercase version of I  
(uppercase i, a single byte) in Turkish is ı (an "i" without a dot,  
definitely not a single-byte character).



Jonas___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

2009-07-25 Thread Luiz Americo Pereira Camara


theo escreveu:

@Luiz Americo

Your code
WideCompareText(UTF8Decode(Key), UTF8Decode(Str))
will work, but if speed matters, then it's rather bad.
  


Hi, i'm aware that the performance is bad although had not tested like 
you did, but at this point i'd like to stick with a solution that fpc 
provides natively since it's being used in a fpc component 
(TSqlite3Dataset).


In last revision i switched to the ansi version of the functions to save 
the conversion of the Key at each comparison. See 
http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/packages/fcl-db/src/sqlite/customsqliteds.pas?view=log#rev13431


Anyway is clear that functions to handle UTF8 and unicode in general is 
missing in fpc...

I've tried to make a faster function for UTF-8:
  


... maybe your function can be used as a base to future development. Add 
a new function to the widestringmanager?


Luiz

uses unicodeinfo, LCLProc;

function UTF8CompareText(s1, s2: UTF8String): Integer;
var u1, u2: Ucs4Char;
  u1l, u2l: longint;
  BytePos1, Len1, SLen1: integer;
  BytePos2, Len2, SLen2: integer;
begin
  Result := 0;
  BytePos1 := 1;
  BytePos2 := 1;
  SLen1 := System.Length(s1);
  SLen2 := System.Length(s2);

  if SLen1 <> SLen2 then  //Assuming lower/uppercase representations
have the same byte length
  begin
if SLen1 > SLen2 then Result := 1 else Result := -1;
exit;
  end;

  repeat
u1 := UTF8CharacterToUnicode(@s1[BytePos1], Len1);
inc(BytePos1, Len1);
u2 := UTF8CharacterToUnicode(@s2[BytePos2], Len2);
inc(BytePos2, Len2);
if u1 <> u2 then
begin
  {$IFDEF useunicodinfo}
  u1l := unicodeinfo.utf8proc_get_property(u1)^.lowercase_mapping;
  if u1l <> -1 then u1 := u1l;
  u2l := unicodeinfo.utf8proc_get_property(u2)^.lowercase_mapping;
  if u2l <> -1 then u2 := u2l;
  {$ELSE}
  u1 := UCS4Char(WideUpperCase(WideChar(u1))[1]);
  u2 := UCS4Char(WideUpperCase(WideChar(u2))[1]);
  {$ENDIF}
  if u1 <> u2 then
  begin
Result := u1 - u2;
exit;
  end;
end;
  until (BytePos1 > SLen1) or (BytePos2 > SLen2)
end;


Some numbers for my system (Linux) where WideCompareText is the function
you use now, WideUppercase is the above function and unicodeinfo is
the above function with useunicodinfo defined. See here
http://wiki.lazarus.freepascal.org/Theodp


Comparing identical Strings of 322 Chars 1 times
WideCompareText: 785ms
unicodeinfo: 75ms
WideUpperCase: 74ms

Comparing Strings of 322 Chars 1 times where the 3rd char differs
WideCompareText: 268ms
unicodeinfo: 3ms
WideUpperCase: 8ms

Comparing identical Text of 322 Chars 1 times where one Text is all
uppercase
WideCompareText: 810ms
unicodeinfo: 121ms
WideUpperCase: 1076ms

Regards Theo

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

  


___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

2009-07-25 Thread theo


>>  if SLen1 <> SLen2 then  //Assuming lower/uppercase representations
>> have the same byte length
>
> That is a wrong assumption. E.g., the lowercase version of I
> (uppercase i, a single byte) in Turkish is ı (an "i" without a dot,
> definitely not a single-byte character).

OK thanks. That's why I added the comment, because I was not sure  ;-) 
So then one should compare UTF8Lengths or probably forget that shortcut,
because calculating UTF8Lengths is not cheap.

Do turkish systems behave differently for WideLowerCase('I')?
Will they return $0131 instead of $0069 ?

Regards Theo

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

2009-07-25 Thread Jonas Maebe



On 25 Jul 2009, at 19:03, theo wrote:


Do turkish systems behave differently for WideLowerCase('I')?
Will they return $0131 instead of $0069 ?


They should, since the uppercase version of i is İ there (i.e., a  
capital I with a dot on top). See e.g. http://www.i18nguy.com/unicode/turkish-i18n.html



Jonas___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

Re[2]: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

2009-07-25 Thread JoshyFun

Hello FPC-Pascal,

Saturday, July 25, 2009, 5:46:39 PM, you wrote:

t> @Luiz Americo

t> Your code
t> WideCompareText(UTF8Decode(Key), UTF8Decode(Str))
t> will work, but if speed matters, then it's rather bad.

That's not right, the assumption that:

lowercasemapping(a)=lowercasemapping(b)

is the same as:

IsSameText(a,b)

is wrong at unicode levels.

-- 
Best regards,
 JoshyFun

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

Re[2]: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

2009-07-25 Thread JoshyFun

Hello FPC-Pascal,

Thursday, July 23, 2009, 2:02:38 PM, you wrote:

LAPC> Hi, i'm aware that the performance is bad although had not tested like
LAPC> you did, but at this point i'd like to stick with a solution that fpc
LAPC> provides natively since it's being used in a fpc component 
LAPC> (TSqlite3Dataset).

Write unicode functions in UTF8 is almost non-sense, most unicode
operations are not like we are used in the ANSI world, in unicode also
there are a language context as in example in spanish 'á' renders to
uppercase 'Á' but in other languages they are different letters.

There are some functions named "general case" which perform a
reasonable job for most used languages and only introduce errors in
non widespread ones.

I have some implementations for the general case, not heavily tested,
like sametext, upper, lower and a bit more.

The code is not optimized but if somebody wants to use them please ask
:)

The case of the SameText is specially CPU consumer as each string must
be transformed several times before the comparation is some complex
characters are present.

-- 
Best regards,
 JoshyFun

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] cocoa usage

2009-07-25 Thread dmitry boyarintsev

> I am trying to get into cocoa...
> using lazarus from svn TRUNK I get some files in lcl/units/i386-darwin/cocoa
remove these compiled units files.

> can someone clarify what to use? Shell I copy the lazarus-cc stuff into the 
> lcl stuff above?
you don't need to copy from ccr to lcl.

upload cocoa units from lazarus-ccr

you should find cocoa.lpk at lazarus-ccr/bindings/pascocoa/ dir.
install the package.
after the package is installed you can rebuild cocoa-LCL.

don't expect too much from the widgetset.

thanks,
dmitry
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] cocoa usage

2009-07-25 Thread dmitry boyarintsev

Installing the cocoa package, should do the search path for you.
so you don't need to modify fpc.cfg or add compiler options to a lazarus project

If you don't use cocoa the package, then you need to add units search
paths manually. In any way you find it easier (modifying fpc.cfg or
lazarus project compiler options).

thanks,
dmitry
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

Re: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

Re[2]: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

Re[2]: [fpc-pascal] Case insensitive comparison of strings with non-ascii characters

Re: [fpc-pascal] cocoa usage

Re: [fpc-pascal] cocoa usage

9 matches

Site Navigation

Mail list logo

Footer information