Re: [fpc-pascal] Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

LacaK Sun, 15 Sep 2019 23:30:32 -0700

Hi *,
As promised, I discussed the idea of adding support for UTF-16 encodedtext files (and preferably UTF-32 as well while at it) to the RTL withother core team members. Overall, I didn't come across anybody oposingthis idea, the only (logical) requirement is taking care of theperformance implications of this change, i.e. avoiding considerableperformance decrease in processing of 8-bit encoded files (actually,this is one of reasons of my suggestion to add codepoint sizeinformation to the text file record and use that instead of checkingindividual values of the codepage variable to find out the codepointsize implications every time working with the file - see below).
Yes, I believe that extending SetTextCodePage with supporting UTF-16
makes sense (with certain caveats like that calling it should be
performed before Rewrite in case of new files creation, or otherwise
the BOM mark will not be added to the beginning of the file). The
other question is what needs to happen within the text file record -
as mentioned in my other post, I'd prefer adding a new field
specifying the codepoint size rather than having to check for specific
codepage values in all code branches which would need to be created
for handling the difference.

Moreover, the case of opening a file is somewhat trickier, because the
file may have the encoding specified within the file itself. Would we
add code for reading the first bytes every time Reset is called for a
text file not associated with another device (console) and set the
fields in the text file record (possibly overriding an explicit
setting from SetTextCodePage)? Personally, I'd do so, but others may
have a different opinion.
 .
 .
After the discussion with some people from the core team, I suggestthe following:
1) New attribute for the codepoint size will be added to the text filerecord and all the text file I/O needs to be checked and possiblyextended to with using this attribute instead of current implicitexpectation that the codepoint size is always 1 byte.
2) Support for UTF-16BE/LE and UTF-32BE/LE will be added toSetTextCodePage, the new codepoint size attribute will be updated asappropriate.
3) New function 'DetectUtfBom (var T: text): boolean' will be added.This function may be called after the call to 'Reset (T: text)' tocheck for existence of BOM at the beginning of the text file. If it isfound (Result=true), SetTextCodePage is invoked automatically fromDetectUtfBom with the codepage value corresponding to the found BOMand encoding variant. If BOM is not found (Result=false), nothingchanges.
4) A new procedure 'SetUtfBom (var T: text; CodePage: word; BOM:boolean)' will be added. This procedure may be called after the callto Rewrite and allows writing BOM to the respective text file.SetTextCodePage with the respective value will be called from SetUtfBom.
Comments, anybody?

Thank you Tomas.

My comment per function names: use pattern SetText.../GetText...
So for inspiration:
- GetTextBOM or ReadBOM(var T: Text; SetCodePage:boolean=True):Word,

(parameter SetCodePage should specify if SetTextCodePage will becalled automatically if desired). Retunr value will be CP_NONE (no BOM)or CP_UTF8, CP_UTF16,...

- SetTextBOM or WriteBOM(var T: Text; CodePage: Word)

(writtes BOM corresponding to given CodePage and will callSetTextCodePage). Boolean BOM parameter is IMO not needed as call tothis function signals, that user wants write BOM else user will callSetTextCodePage() only


-Laco.

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Reply via email to