On 16/09/18 13:31, Martok wrote:
Let's say the user directs a program to "treat this file as $codepage".
Therefore, I need to read it as this codepage and fill internal data structures
with strings in that codepage, while keeping other operations in the system
codepage (so I can't just change DefaultSystemCodepage). Does that mean that
there is no way to do this with native strings?

I can only second guess the ideas behind Embarcadero's introduction of the codepage-aware ansistrings, but I think the main purpose was to make it easier to convert existing code that was written for a particular system codepage into a program that works with unicodestring.

Hence, the codepage-aware string functionality supports setting the wanted code page at the input and output level, and everything in between is expected to be performed using either unicodestring or DefaultSystemCodePage. FPC slightly extended this so that the encoding of system file names (and the code page returned by related routines) can be specified differently, so that you can easily set DefaultSystemCodePage to CP_UTF8 (or something else) regardless of what codepage the system's APIs used by the RTL expect.

In general, a program will seldom have built-in support for analysing and manipulating strings in every possible codepage in existence. The general paradigm is to convert a string to a single encoding that is used internally, perform analysis/processing using this incoding, and then convert it back. In fact, that is what many runtime library routines also do because most OS library functions only support a very select number of codepages (or the OS library functions do it themselves interanlly).

If you don't care about the codepage and won't perform any processing that depends on the code page, then codepage-aware strings are probably the wrong data structure. Arrays may be more appropriate.

Alternatively, you can set the codepage of your text file to DefaultSystemCodePage and read a regular ansistring from it. You can still force the code page of the string you read afterwards to something else using SetStringCodePage() if you wish to use the equivalent of an explicit typecast at the string codepage level. But indeed, this is not a workflow that the codepage-aware strings support without extra work on your part, and as explained above, I don't think it was ever intended to be either.

TL;DR: "AnsiString"/"String" is a type that has the code page that was
determined at startup, not one that turns itself into whatever code page
gets thrown at it
Actually, there is a String type that is just that (at least according to the
wiki): RawByteString. Supposedly, it just accepts any dynamic codepage without
conversion. But it doesn't work for either of the cases here?

RawByteString is something that is largely undocumented by Embarcadero. I tried my best to make the behaviour as compatible as possible with Delphi, but there are still bugs in it (and holes in my knowledge about how exactly they are supposed to behave in all possible situations).


Jonas
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to