Michael Schnell wrote on Tue, 19 Apr 2016:

On 04/19/2016 08:22 AM, Jonas Maebe wrote:
When any {$codepage xxx} directive is specified, string constants in the source are represented in a way that makes lossless conversion to any other code page possible. This conversion to the target code page is performed at compile time where possible (when the target code page cannot change at run time), and otherwise at run time.

Of course I do understand that.

But anyway, AFAIK, UTF8 already is a way of lossless coding, so I don't see a forcing necessity to convert that to UTF16 at compile time. And as far as I understand, if the user does not take some means, the executable will work with 8 bit coding and very likely with UTF8, > so holding the constants as UTF16 increases as well memory as CPU resource usage.

The reasons are
a) the FPC compiler binary itself prior to 3.x did not contain any UTF-8 encoding support. All it could do was convert the source file code page to UTF-16. b) FPC's widestring manager does not contain any interface to directly convert from one 8 bit encoding to another, only from 16 bit to 8 bit and vice versa (which also made it useless to convert to UTF-8 at compile time, since there was no way to convert it to another code page at run time except by making a round trip via UTF-16 anyway). The reason is that these helpers were already necessary to convert between widestring/unicodestring and other types when assigning such variables to each other c) changing b) would require a lot of testing because not all code page conversion libraries/OS interfaces support converting from any arbitrary character set to any other arbitrary character set. While it is likely that they all support converting from arbitrary code pages to UTF-8 and back (like they do for UTF-16), this would still have to be tested and additionally such an interface would undoubtedly also starting to get used for other code pages by people unaware of this limitation. Adding an interface limited to converting from/to UTF-8 would be another option to address that though.

In the end it would be a lot of work, result in a lot of extra code that may not work everywhere (or in specialised routines), and it would be for a use case you can already address yourself if you absolutely want to be completely UTF-8-centric: you can declare your string variables as UTF8Stringm since then the conversion to UTF-8 for constant strings will happen at compile time. If your DefaultSystemCodePage is CP_UTF8, no extra conversion will happen when assigning/converting these variables to regular ansistrings. Furthermore, converting from UTF-8 to other code pages is probably slower than from UTF-16, since UTF-8 is a more complex encoding for most characters.

The fact that other things are less convenient if you use UTF8String is the price you will have to pay for such code specialisation (which probably won't make any noticeable difference in 99.999% of the cases anyway).


Jonas
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to