[PATCH] D93031: Enable fexec-charset option

Abhina Sree via Phabricator via cfe-commits Mon, 21 Dec 2020 08:25:50 -0800

abhina.sreeskantharajan marked 11 inline comments as done.
abhina.sreeskantharajan added inline comments.



================
Comment at: clang/include/clang/Driver/Options.td:3583-3584
 
+def fexec_charset : Separate<["-"], "fexec-charset">, 
MetaVarName<"<codepage>">,
+  HelpText<"Set the execution <codepage> for string and character literals">;
 def target_cpu : Separate<["-"], "target-cpu">,
----------------
tahonermann wrote:
> abhina.sreeskantharajan wrote:
> > tahonermann wrote:
> > > How about substituting "character set", "character encoding", or 
> > > "charset" for "codepage"?
> > > 
> > > This doesn't state what names are recognized.  The ones provided by the 
> > > system iconv() implementation (as is the case for gcc)?  Or all names and 
> > > aliases specified by the [[ 
> > > https://www.iana.org/assignments/character-sets/character-sets.xhtml | 
> > > IANA character set registry ]]?
> > > 
> > > The set of recognized names can be a superset of the names that are 
> > > actually supported.
> > I've updated the description from codepage to charset.
> > 
> > It's hard to specify what charsets are supported because iconv library 
> > differs between targets, so the list will not be the same on every platform.
> Being dependent on the host iconv library seems fine by me; that is the case 
> for gcc today.  I suggest making that explicit here:
>   def fexec_charset : Separate<["-"], "fexec-charset">, 
> MetaVarName<"<charset>">,
>     HelpText<"Set the execution <charset> for string and character literals.  
> Supported character encodings include XXX and those supported by the host 
> iconv library.">;
I've updated the HelpText with your suggested description.


================
Comment at: clang/include/clang/Lex/LiteralSupport.h:192
+                    ConversionState translationState = TranslateToExecCharset);
+  ConversionState TranslationState;
 
----------------
tahonermann wrote:
> Does the conversion state need to be persisted as a data member?  The literal 
> is consumed in the constructor.
Thanks, I've removed this.


================
Comment at: clang/include/clang/Lex/LiteralSupport.h:244
   bool Pascal;
+  ConversionState TranslationState;
+
----------------
tahonermann wrote:
> Same concern here with respect to persisting the conversion state as a data 
> member.
If this member is removed in StringLiteralParser, we will need to pass the 
State to multiple functions in StringLiteralParser like init(). Would this 
solution be preferable to keeping a data member?


================
Comment at: clang/lib/Driver/ToolChains/Clang.cpp:5969-5977
+  // Pass all -fexec-charset options to cc1.
+  std::vector<std::string> vList =
+      Args.getAllArgValues(options::OPT_fexec_charset_EQ);
+  // Set the default fexec-charset as the system charset.
+  CmdArgs.push_back("-fexec-charset");
+  CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
+  for (auto it = vList.begin(), ie = vList.end(); it != ie; ++it) {
----------------
tahonermann wrote:
> abhina.sreeskantharajan wrote:
> > tahonermann wrote:
> > > I think it would be preferable to diagnose an unrecognized character 
> > > encoding name here if possible.  The current changes will result in an 
> > > unrecognized name (as opposed to one that is unsupported for the target) 
> > > being diagnosed for each compiler instance.
> > Since we do not know what charsets are supported by the iconv library on 
> > the target platform, we don't know what charsets are actually invalid until 
> > we try creating a CharSetConverter.
> Understood, but what would be the harm in performing a lookup (constructing a 
> `CharSetConverter`) here?
I initially thought it will be a performance issue if we are creating the 
Converter twice, once here and once in the Preprocessor. But I do think its a 
good idea to diagnose this early. I've modified the code to diagnose and error 
here.


================
Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3573
+    StringRef Value = ExecCharset->getValue();
+    Opts.ExecCharset = (std::string)Value;
+  }
----------------
tahonermann wrote:
> abhina.sreeskantharajan wrote:
> > tahonermann wrote:
> > > I wouldn't expect the cast to `std::string` to be needed here.
> > Without that cast, I get the following build error:
> > ```
> >  error: no viable overloaded '='
> > ```
> Ok, rather than a cast, I suggest:
>   Opts.ExecCharset = Value.str();
Thanks, I've applied this change.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:231-239
+  if (Translate && Converter) {
+    // ResultChar is either UTF-8 or ASCII literal and can only be converted
+    // to EBCDIC on z/OS if the character can be represented in one byte.
+    if (ResultChar < 0x100) {
+      SmallString<8> ResultCharConv;
+      Converter->convert(StringRef((char *)&ResultChar), ResultCharConv);
+      void *Pointer = &ResultChar;
----------------
tahonermann wrote:
> rsmith wrote:
> > Is it correct, in general, to do character-at-a-time translation here, when 
> > processing a string literal? I would expect there to be some (stateful) 
> > target character sets where that's not correct.
> For stateful encodings, I can imagine that state would have to be 
> transitioned to the initial state before translating the escape sequence.  I 
> suspect support for stateful encodings is not a goal at this time.
Right, stateful encodings may be a problem we will need to revisit later as 
well.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:234
+    // to EBCDIC on z/OS if the character can be represented in one byte.
+    if (ResultChar < 0x100) {
+      SmallString<8> ResultCharConv;
----------------
tahonermann wrote:
> What should happen if `ResultChar` >= 0x100?  IBM-1047 does have 
> representation for other UTF-8 characters.  Regardless, it seems `ResultChar` 
> should be converted to something.
This is no longer valid, thanks for catching that. We were initially 
translating to ASCII instead of UTF-8 so we needed to guard against larger 
characters. I've removed this guard since the internal charset is UTF-8.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:236
+      SmallString<8> ResultCharConv;
+      Converter->convert(StringRef((char *)&ResultChar), ResultCharConv);
+      void *Pointer = &ResultChar;
----------------
rsmith wrote:
> Reinterpreting an `unsigned*` as a `char*` like this is not correct on 
> big-endian, and is way too "clever" on little-endian. Please create an actual 
> `char` object to hold the value and pass that in instead.
Thanks, I've created a char instead.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:238
+      void *Pointer = &ResultChar;
+      memcpy(Pointer, ResultCharConv.data(), sizeof(unsigned));
+    }
----------------
rsmith wrote:
> What should happen if the result doesn't fit into an `unsigned`? This also 
> appears to be making problematic assumptions about the endianness of the 
> host. If we really want to pack multiple bytes of encoded output into a 
> single `unsigned` result value (which itself seems dubious), we should do so 
> with an endianness that doesn't depend on the host.
This may be a problem we need to revisit since ResultChar is expecting a char.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1742-1745
           EncodeUCNEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd,
                           ResultPtr, hadError,
                           FullSourceLoc(StringToks[i].getLocation(), SM),
                           CharByteWidth, Diags, Features);
----------------
tahonermann wrote:
> UCNs will require conversion here.
I've added code to translate UCN characters and have updated the testcase as 
well.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93031/new/

https://reviews.llvm.org/D93031

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D93031: Enable fexec-charset option

Reply via email to