On Wed, 11 May 2016, Jonas Maebe wrote:


Graeme Geldenhuys wrote on Wed, 11 May 2016:

In my application I enable unicodestring mode. So I'm reading data from
a Firebird database. The data is stored as UTF-8 in a VarChar field. The
DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
up with a default encoding of Latin-1.

So I read the UTF-8 data from the database, somewhere inside the SqlDB
code it gets assigned to a TField's String property. ie: UTF-8 ->
Latin-1 conversion.

This depends on how sqlDB is implemented, and I have absolutely no clue about that (other than what LacaK wrote).

As mentioned at http://wiki.freepascal.org/FPC_Unicode_support#Dynamic_code_page , conversions on assignment only happen when the *declared* code page of the target string is different from that of the source string (other than the special case for RawByteString). So if sqlDB only uses plain String with {$h+} and/or AnsiString, then no conversions will happen anywhere in the scenario you describe since it will just assign ansistrings with declared code page CP_ACP to each other.

This is the case.


Then I read the field value into my application. ie: Latin-1 -> UTF-16

If sqlDB correctly sets the dynamic codepage of the strings it creates via SetCodePage(x,CP_UTF8,false), then when you assign those strings with declared codepage = CP_ACP and dynamic code page CP_UTF8 to your unicodestrings, they will be converted from UTF-8 to UTF-16 at that point.

It does not do this.


If it does not set the dynamic code page of the strings it creates to the appropriate encoding, then you will indeed get data corruption at this point, because the UTF-8 encoded data will be interpreted as Latin-1 and then be "converted" to UTF-16.

That is what happens.

Currently, the ONLY provision that is made is that, if SQLDB detects somehow 
that the
server uses UTF8, it will use an ansistring, allocate 4 bytes in the buffers 
for each
character.

But it currently does not set the code page of the allocated string to UTF8.

For dealing with such code, which is not yet codepage-aware, by default the situation is no worse or no better than it was in previous FPC versions: exactly the same would happen there. However, in FPC 3.x you can generally fix it by changing the default code page for ansistrings using SetMultiByteConversionCodePage() to what you know/want to be the encoding of ansistrings, like Lazarus does.

If Lazarus already sets SetMultiByteConversionCodePage, then it will wreak
havoc to set it to something else.

This matter must be decided at the TDataset level: it should have a property
to determine the character set of string fields (and possibly different for
each field, since this can differ in the database on a field basis).


All of this is moreover completely independent of {$modeswitch unicodestrings}, since that is just a shortcut to make String an alias for UnicodeString in the current compilation module (and Char for WideChar, and PChar for PWideChar).

Honestly, I don't understand this preoccupation with {$modeswitch  
unicodestrings}

It just means that

Var
 a : string;

is read by the compiler as

Var
 a : unicodestring;

No more, no less.

Michael.
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to