Re: [fpc-pascal] Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tomas Hajny Fri, 06 Sep 2019 06:23:40 -0700

On 2019-09-06 07:24, LacaK wrote:

From user POV we have this situation:
- on one side there is input text file encoded UTF-16 (either LE or BE)
- on other side there is FPC, where RTL procedures like AssignFile,
SetTextCodePage, Reset, Read(Ln), Write(Ln) are available.


My original intention was simply use call to existing procedure
SetTextCodePage with parameter CP_UTF16, which in my opinion will
simply signal to RTL, that input/output text file is/should be encoded
using UTF16.

Yes, I believe that extending SetTextCodePage with supporting UTF-16makes sense (with certain caveats like that calling it should beperformed before Rewrite in case of new files creation, or otherwise theBOM mark will not be added to the beginning of the file). The otherquestion is what needs to happen within the text file record - asmentioned in my other post, I'd prefer adding a new field specifying thecodepoint size rather than having to check for specific codepage valuesin all code branches which would need to be created for handling thedifference.

Moreover, the case of opening a file is somewhat trickier, because thefile may have the encoding specified within the file itself. Would weadd code for reading the first bytes every time Reset is called for atext file not associated with another device (console) and set thefields in the text file record (possibly overriding an explicit settingfrom SetTextCodePage)? Personally, I'd do so, but others may have adifferent opinion.

Then any subsequent call to ReadLn with any destination variable
(ansistring, unicodestring, integer, etc.) will simply do something
like:
- read from file byte sequence, which will be interpreted as UTF-16 so
we will have on input UnicodeString

Just a comment - if already adding this support, we should IMHO allowUTF-32 as well.

- this UnicodeString will be further transliterated to requested
destination variable (as there are in FPC implicit conversions between
UnicodeString and AnsiString this would be no problem)


Yes.

(for Write(Ln) same will happen only in reverse order: source variable
-> UnicodeString -> Write to File)

If SetTextCodePage(CP_UTF16) is not appropriate, then we must IMO
introduce any new procedure which will give to user possibility signal
that "I have UTF-16 encoded text file" or "I want that all writes to
my text file should be encoded UTF-16".
(but personally I do not see reason to introduce new procedure as
SetTetCodePage for me perfectly fit)

See above - a new procedure may not be needed, but I'd prefer a new textfile record field in the background for better efficiency andmaintainability.

So firstly we need design/proposal, which is/will be accepted.
(probably here is needed deeper knowledge of RTL internals so it is
reason why also others core developers should step in)

Right. See my input above for my current thoughts. In the end, we shouldpreferably extend the FPC Unicode handling page in the Wiki; in themeantime, a new page may be used for documenting the specification.Before doing that, I'd still want to hear the opinion from Jonas, Marcoand Michael - I'll ask them.


Tomas
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Reply via email to