On Tue, May 18, 2021 at 5:31 AM Iain Buclaw <ibuc...@gdcproject.org> wrote:
>
> Excerpts from Lewis Hyatt via Gcc-patches's message of January 29, 2021 4:46 
> pm:
> > Q1: What is the input charset?
> > A1:
> >
> >     libcpp: Whatever was passed to -finput-charset (note, for Fortran,
> >     -finput-charset is not supported though.)
> >
> >     go: Assume UTF-8.
> >
> >     D: UTF-8, UTF-16, or UTF-32 (the latter two in either
> >        endianness); determined by inspecting the first bytes of the file.
> >
> > Q2: How should a UTF-8 BOM be handled?
> > A2:
> >
> >     libcpp: Treat entirely the same, as if it was not present at all. So
> >     a diagnostic referring to the first non-BOM character in the file will
> >     point to column 1, not to column 4.
> >
> >     go: Treat it like whitespace, ignored for parsing purposes, but still
> >     logically part of the file. A diagnostic referring to the first non-BOM
> >     character in the file will point to column 4.
> >
> >     D: Same as libcpp.
> >
> > So input.c's current behavior (with or without my patch) actually matches
> > the "go" frontend exactly and does the right thing for it. As you
> > thought, though, for D it would be desirable to skip over the UTF-8 BOM
> > too.
> >
> > D adds an additional wrinkle in that, instead of using -finput-charset, it
> > uses its own detection logic -- based on knowledge that the first codepoint
> > in the file must be ASCII, it is able to deduce the encoding from the first
> > few bytes. This means that in D, each source file can have a different
> > encoding, which is not expressible in libcpp frontends currently, and hence
> > the notion of a global input_cpp_context with a constant charset is not
> > sufficient to capture this.
> >
>
> Yes, that is correct for D.
>
> > In this iteration of the patch, I replaced the input_cpp_context with a more
> > general input_context that can handle all of these cases. The context now
> > consists of a callback function and a bool, which the frontend is supposed
> > to configure. The bool determines whether or not to skip the BOM. The
> > callback, when passed a filename, returns the input charset needed to
> > convert that file. For libcpp, the filename is not used as the charset is
> > determined from -finput-charset for all files. For D, the callback function
> > is currently a stub, but it could be expanded to open the file, read the
> > first few bytes, and determine the charset to be used. I have not
> > implemented it for now, because it seems inelegant to duplicate the logic D
> > already has for this detection, but this logic is part of the dmd/ tree
> > which I think is maintained external to GCC, and so it didn't seem right to
> > attempt to refactor it. I figured there may not be much interest in this
> > feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in
> > D), and this patch makes no change to it. This patch does fix the handling
> > of a UTF-8 BOM in D diagnostics, however.
> >
>
> And yes, dmd/ is maintained in a separate upstream repository.
>
> I wouldn't worry about UTF-16 and UTF-32.  I expect all the sources
> written in those charsets are found only in the D testsuite.
>
> > gcc/d/ChangeLog:
> >
> >       PR other/93067
> >       * d-lang.cc (d_input_charset_callback): New function.
> >       (d_init): Call new function input_initialize_context().
> >
>
> > diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
> > index 0fd207da7f3..9f170c09886 100644
> > --- a/gcc/d/d-lang.cc
> > +++ b/gcc/d/d-lang.cc
> > @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "output.h"
> >  #include "print-tree.h"
> >  #include "debug.h"
> > +#include "input.h"
> >
> >  #include "d-tree.h"
> >  #include "id.h"
> > @@ -362,6 +363,19 @@ d_option_lang_mask (void)
> >    return CL_D;
> >  }
> >
> > +/* Implements input charset and BOM skipping configuration for
> > +   diagnostics.  */
> > +static const char *d_input_charset_callback (const char * /*filename*/)
> > +{
> > +  /* TODO: The input charset is automatically determined by code in
> > +     dmd/dmodule.c based on the contents of the file.  If this detection
> > +     logic were factored out and could be reused here, then we would be 
> > able
> > +     to return UTF-16 or UTF-32 as needed here.  For now, we return always
> > +     NULL, which means no conversion is necessary, i.e. the input is 
> > assumed
> > +     to be UTF-8 when diagnostics read this file.  */
> > +  return NULL;
> > +}
> > +
>
> OK from me.
>
> I can have a look into this, but it will have to wait until after I do
> an impending switch-over of the dmd/ sources from C++ to D.  Then I'd
> have to convince upstream to be open to the change (refactoring has been
> getting a bad rep as of late for sneaking regressions into upstream D).
>
> Iain.

Thank you very much for taking a look at it. I am not sure yet if
David likes the approach of this patch, but in case he does approve
and we go this way or similar, I'll keep the D part like this then.
I think you must be right that UTF-16 and UTF-32 are not widely used,
at least, anyone trying to use them now would receive garbled
diagnostics. But at least with this patch, almost all the needed
pieces to support it for diagnostics are in place.

-Lewis


-Lewis

Reply via email to