Re: [fpc-pascal] FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Jonas Maebe Wed, 20 Apr 2016 02:26:49 -0700


Michael Schnell wrote on Tue, 19 Apr 2016:

On 04/19/2016 08:22 AM, Jonas Maebe wrote:
When any {$codepage xxx} directive is specified, string constantsin the source are represented in a way that makes losslessconversion to any other code page possible. This conversion to thetarget code page is performed at compile time where possible (whenthe target code page cannot change at run time), and otherwise atrun time.
Of course I do understand that.
But anyway, AFAIK, UTF8 already is a way of lossless coding, so Idon't see a forcing necessity to convert that to UTF16 at compiletime. And as far as I understand, if the user does not take somemeans, the executable will work with 8 bit coding and very likelywith UTF8, > so holding the constants as UTF16 increases as wellmemory as CPU resource usage.


The reasons are

a) the FPC compiler binary itself prior to 3.x did not contain anyUTF-8 encoding support. All it could do was convert the source filecode page to UTF-16.b) FPC's widestring manager does not contain any interface to directlyconvert from one 8 bit encoding to another, only from 16 bit to 8 bitand vice versa (which also made it useless to convert to UTF-8 atcompile time, since there was no way to convert it to another codepage at run time except by making a round trip via UTF-16 anyway). Thereason is that these helpers were already necessary to convert betweenwidestring/unicodestring and other types when assigning such variablesto each otherc) changing b) would require a lot of testing because not all codepage conversion libraries/OS interfaces support converting from anyarbitrary character set to any other arbitrary character set. While itis likely that they all support converting from arbitrary code pagesto UTF-8 and back (like they do for UTF-16), this would still have tobe tested and additionally such an interface would undoubtedly alsostarting to get used for other code pages by people unaware of thislimitation. Adding an interface limited to converting from/to UTF-8would be another option to address that though.

In the end it would be a lot of work, result in a lot of extra codethat may not work everywhere (or in specialised routines), and itwould be for a use case you can already address yourself if youabsolutely want to be completely UTF-8-centric: you can declare yourstring variables as UTF8Stringm since then the conversion to UTF-8 forconstant strings will happen at compile time. If yourDefaultSystemCodePage is CP_UTF8, no extra conversion will happen whenassigning/converting these variables to regular ansistrings.Furthermore, converting from UTF-8 to other code pages is probablyslower than from UTF-16, since UTF-8 is a more complex encoding formost characters.

The fact that other things are less convenient if you use UTF8Stringis the price you will have to pay for such code specialisation (whichprobably won't make any noticeable difference in 99.999% of the casesanyway).



Jonas
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Reply via email to