On 2019-09-06 07:24, LacaK wrote:
From user POV we have this situation:
- on one side there is input text file encoded UTF-16 (either LE or BE)
- on other side there is FPC, where RTL procedures like AssignFile,
SetTextCodePage, Reset, Read(Ln), Write(Ln) are available.

My original intention was simply use call to existing procedure
SetTextCodePage with parameter CP_UTF16, which in my opinion will
simply signal to RTL, that input/output text file is/should be encoded
using UTF16.

Yes, I believe that extending SetTextCodePage with supporting UTF-16 makes sense (with certain caveats like that calling it should be performed before Rewrite in case of new files creation, or otherwise the BOM mark will not be added to the beginning of the file). The other question is what needs to happen within the text file record - as mentioned in my other post, I'd prefer adding a new field specifying the codepoint size rather than having to check for specific codepage values in all code branches which would need to be created for handling the difference.

Moreover, the case of opening a file is somewhat trickier, because the file may have the encoding specified within the file itself. Would we add code for reading the first bytes every time Reset is called for a text file not associated with another device (console) and set the fields in the text file record (possibly overriding an explicit setting from SetTextCodePage)? Personally, I'd do so, but others may have a different opinion.


Then any subsequent call to ReadLn with any destination variable
(ansistring, unicodestring, integer, etc.) will simply do something
like:
- read from file byte sequence, which will be interpreted as UTF-16 so
we will have on input UnicodeString

Just a comment - if already adding this support, we should IMHO allow UTF-32 as well.


- this UnicodeString will be further transliterated to requested
destination variable (as there are in FPC implicit conversions between
UnicodeString and AnsiString this would be no problem)

Yes.


(for Write(Ln) same will happen only in reverse order: source variable
-> UnicodeString -> Write to File)

If SetTextCodePage(CP_UTF16) is not appropriate, then we must IMO
introduce any new procedure which will give to user possibility signal
that "I have UTF-16 encoded text file" or "I want that all writes to
my text file should be encoded UTF-16".
(but personally I do not see reason to introduce new procedure as
SetTetCodePage for me perfectly fit)

See above - a new procedure may not be needed, but I'd prefer a new text file record field in the background for better efficiency and maintainability.


So firstly we need design/proposal, which is/will be accepted.
(probably here is needed deeper knowledge of RTL internals so it is
reason why also others core developers should step in)

Right. See my input above for my current thoughts. In the end, we should preferably extend the FPC Unicode handling page in the Wiki; in the meantime, a new page may be used for documenting the specification. Before doing that, I'd still want to hear the opinion from Jonas, Marco and Michael - I'll ask them.

Tomas
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to