abhina.sreeskantharajan marked 11 inline comments as done. abhina.sreeskantharajan added inline comments.
================ Comment at: clang/include/clang/Driver/Options.td:3583-3584 +def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<codepage>">, + HelpText<"Set the execution <codepage> for string and character literals">; def target_cpu : Separate<["-"], "target-cpu">, ---------------- tahonermann wrote: > abhina.sreeskantharajan wrote: > > tahonermann wrote: > > > How about substituting "character set", "character encoding", or > > > "charset" for "codepage"? > > > > > > This doesn't state what names are recognized. The ones provided by the > > > system iconv() implementation (as is the case for gcc)? Or all names and > > > aliases specified by the [[ > > > https://www.iana.org/assignments/character-sets/character-sets.xhtml | > > > IANA character set registry ]]? > > > > > > The set of recognized names can be a superset of the names that are > > > actually supported. > > I've updated the description from codepage to charset. > > > > It's hard to specify what charsets are supported because iconv library > > differs between targets, so the list will not be the same on every platform. > Being dependent on the host iconv library seems fine by me; that is the case > for gcc today. I suggest making that explicit here: > def fexec_charset : Separate<["-"], "fexec-charset">, > MetaVarName<"<charset>">, > HelpText<"Set the execution <charset> for string and character literals. > Supported character encodings include XXX and those supported by the host > iconv library.">; I've updated the HelpText with your suggested description. ================ Comment at: clang/include/clang/Lex/LiteralSupport.h:192 + ConversionState translationState = TranslateToExecCharset); + ConversionState TranslationState; ---------------- tahonermann wrote: > Does the conversion state need to be persisted as a data member? The literal > is consumed in the constructor. Thanks, I've removed this. ================ Comment at: clang/include/clang/Lex/LiteralSupport.h:244 bool Pascal; + ConversionState TranslationState; + ---------------- tahonermann wrote: > Same concern here with respect to persisting the conversion state as a data > member. If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member? ================ Comment at: clang/lib/Driver/ToolChains/Clang.cpp:5969-5977 + // Pass all -fexec-charset options to cc1. + std::vector<std::string> vList = + Args.getAllArgValues(options::OPT_fexec_charset_EQ); + // Set the default fexec-charset as the system charset. + CmdArgs.push_back("-fexec-charset"); + CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset())); + for (auto it = vList.begin(), ie = vList.end(); it != ie; ++it) { ---------------- tahonermann wrote: > abhina.sreeskantharajan wrote: > > tahonermann wrote: > > > I think it would be preferable to diagnose an unrecognized character > > > encoding name here if possible. The current changes will result in an > > > unrecognized name (as opposed to one that is unsupported for the target) > > > being diagnosed for each compiler instance. > > Since we do not know what charsets are supported by the iconv library on > > the target platform, we don't know what charsets are actually invalid until > > we try creating a CharSetConverter. > Understood, but what would be the harm in performing a lookup (constructing a > `CharSetConverter`) here? I initially thought it will be a performance issue if we are creating the Converter twice, once here and once in the Preprocessor. But I do think its a good idea to diagnose this early. I've modified the code to diagnose and error here. ================ Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3573 + StringRef Value = ExecCharset->getValue(); + Opts.ExecCharset = (std::string)Value; + } ---------------- tahonermann wrote: > abhina.sreeskantharajan wrote: > > tahonermann wrote: > > > I wouldn't expect the cast to `std::string` to be needed here. > > Without that cast, I get the following build error: > > ``` > > error: no viable overloaded '=' > > ``` > Ok, rather than a cast, I suggest: > Opts.ExecCharset = Value.str(); Thanks, I've applied this change. ================ Comment at: clang/lib/Lex/LiteralSupport.cpp:231-239 + if (Translate && Converter) { + // ResultChar is either UTF-8 or ASCII literal and can only be converted + // to EBCDIC on z/OS if the character can be represented in one byte. + if (ResultChar < 0x100) { + SmallString<8> ResultCharConv; + Converter->convert(StringRef((char *)&ResultChar), ResultCharConv); + void *Pointer = &ResultChar; ---------------- tahonermann wrote: > rsmith wrote: > > Is it correct, in general, to do character-at-a-time translation here, when > > processing a string literal? I would expect there to be some (stateful) > > target character sets where that's not correct. > For stateful encodings, I can imagine that state would have to be > transitioned to the initial state before translating the escape sequence. I > suspect support for stateful encodings is not a goal at this time. Right, stateful encodings may be a problem we will need to revisit later as well. ================ Comment at: clang/lib/Lex/LiteralSupport.cpp:234 + // to EBCDIC on z/OS if the character can be represented in one byte. + if (ResultChar < 0x100) { + SmallString<8> ResultCharConv; ---------------- tahonermann wrote: > What should happen if `ResultChar` >= 0x100? IBM-1047 does have > representation for other UTF-8 characters. Regardless, it seems `ResultChar` > should be converted to something. This is no longer valid, thanks for catching that. We were initially translating to ASCII instead of UTF-8 so we needed to guard against larger characters. I've removed this guard since the internal charset is UTF-8. ================ Comment at: clang/lib/Lex/LiteralSupport.cpp:236 + SmallString<8> ResultCharConv; + Converter->convert(StringRef((char *)&ResultChar), ResultCharConv); + void *Pointer = &ResultChar; ---------------- rsmith wrote: > Reinterpreting an `unsigned*` as a `char*` like this is not correct on > big-endian, and is way too "clever" on little-endian. Please create an actual > `char` object to hold the value and pass that in instead. Thanks, I've created a char instead. ================ Comment at: clang/lib/Lex/LiteralSupport.cpp:238 + void *Pointer = &ResultChar; + memcpy(Pointer, ResultCharConv.data(), sizeof(unsigned)); + } ---------------- rsmith wrote: > What should happen if the result doesn't fit into an `unsigned`? This also > appears to be making problematic assumptions about the endianness of the > host. If we really want to pack multiple bytes of encoded output into a > single `unsigned` result value (which itself seems dubious), we should do so > with an endianness that doesn't depend on the host. This may be a problem we need to revisit since ResultChar is expecting a char. ================ Comment at: clang/lib/Lex/LiteralSupport.cpp:1742-1745 EncodeUCNEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd, ResultPtr, hadError, FullSourceLoc(StringToks[i].getLocation(), SM), CharByteWidth, Diags, Features); ---------------- tahonermann wrote: > UCNs will require conversion here. I've added code to translate UCN characters and have updated the testcase as well. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D93031/new/ https://reviews.llvm.org/D93031 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits