Re: [fpc-pascal] RTL and Unicode Strings

Michael Van Canneyt Wed, 11 May 2016 02:50:22 -0700


On Wed, 11 May 2016, Jonas Maebe wrote:

Graeme Geldenhuys wrote on Wed, 11 May 2016:
In my application I enable unicodestring mode. So I'm reading data from
a Firebird database. The data is stored as UTF-8 in a VarChar field. The
DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
up with a default encoding of Latin-1.

So I read the UTF-8 data from the database, somewhere inside the SqlDB
code it gets assigned to a TField's String property. ie: UTF-8 ->
Latin-1 conversion.
This depends on how sqlDB is implemented, and I have absolutely no clue aboutthat (other than what LacaK wrote).
As mentioned athttp://wiki.freepascal.org/FPC_Unicode_support#Dynamic_code_page ,conversions on assignment only happen when the *declared* code page of thetarget string is different from that of the source string (other than thespecial case for RawByteString). So if sqlDB only uses plain String with{$h+} and/or AnsiString, then no conversions will happen anywhere in thescenario you describe since it will just assign ansistrings with declaredcode page CP_ACP to each other.


This is the case.

Then I read the field value into my application. ie: Latin-1 -> UTF-16
If sqlDB correctly sets the dynamic codepage of the strings it creates viaSetCodePage(x,CP_UTF8,false), then when you assign those strings withdeclared codepage = CP_ACP and dynamic code page CP_UTF8 to yourunicodestrings, they will be converted from UTF-8 to UTF-16 at that point.


It does not do this.

If it does not set the dynamic code page of the strings it creates to theappropriate encoding, then you will indeed get data corruption at this point,because the UTF-8 encoded data will be interpreted as Latin-1 and then be"converted" to UTF-16.


That is what happens.

Currently, the ONLY provision that is made is that, if SQLDB detects somehow 
that the
server uses UTF8, it will use an ansistring, allocate 4 bytes in the buffers 
for each
character.

But it currently does not set the code page of the allocated string to UTF8.

For dealing with such code, which is not yet codepage-aware, by default thesituation is no worse or no better than it was in previous FPC versions:exactly the same would happen there. However, in FPC 3.x you can generallyfix it by changing the default code page for ansistrings usingSetMultiByteConversionCodePage() to what you know/want to be the encoding ofansistrings, like Lazarus does.


If Lazarus already sets SetMultiByteConversionCodePage, then it will wreak
havoc to set it to something else.

This matter must be decided at the TDataset level: it should have a property
to determine the character set of string fields (and possibly different for
each field, since this can differ in the database on a field basis).

All of this is moreover completely independent of {$modeswitchunicodestrings}, since that is just a shortcut to make String an alias forUnicodeString in the current compilation module (and Char for WideChar, andPChar for PWideChar).


Honestly, I don't understand this preoccupation with {$modeswitch  
unicodestrings}

It just means that

Var
 a : string;

is read by the compiler as

Var
 a : unicodestring;

No more, no less.

Michael.
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] RTL and Unicode Strings

Reply via email to