[Ada] Properly recognize NEL as end of line in UTF-8 encoding mode

Arnaud Charlet Mon, 21 Nov 2011 03:52:22 -0800

The NEL (NEXT LINE) control code (16#85#) is now properly recognized as
a line terminator, and hence can be used to terminate a source line when
the source program is encoded in UTF-8 mode.


Note that NEL is not recognized as a line terminator in normal 8-bit
mode. For an extensive explanation of why this is, see the section in
the Sinput spec on "Handling of Source Line Terminators".

The first test program here is to be compiled in normal mode with
no wide character switch, so it is normal Latin-1, in which the
code NEL is not recognized as ending a line:

     1. with Ada.Text_IO; use Ada.Text_IO;
     2. procedure Test_NEL is
     3. begin
     4.    -- Ends with VT so should print?   Put_Line("VT works");
     5.    -- Ends with FF so should print?   Put_Line("FF works");
     6.    -- Ends with NEL so should print?  Put_Line("NEL works");
     7. end Test_NEL;

In the above program
   ? in line 4 is the code 16#0B#, VT
   ? in line 5 is the code 16#0C#, FF
   ? in line 6 is the code 16#85#, NEL

Since NEL is not recognized as ending a line in this mode, the output
of this program is two lines:

VT works
FF works

The second program is to be compiled with the -gnatW8 mode specifying
that it is encoded in UTF-8, where the NEL code is recognized.

     1. with Ada.Text_IO; use Ada.Text_IO;
     2. procedure Test_NEL2 is
     3. begin
     4.    -- Ends with VT so should print?   Put_Line("VT works");
     5.    -- Ends with FF so should print?   Put_Line("FF works");
     6.    -- Ends with NEL so should print?? Put_Line("NEL works");
     7. end Test_NEL2;

In the above program
   ? in line 4 is the code 16#0B#, VT
   ? in line 5 is the code 16#0C#, FF
   ?? in line 6 is the sequence 16#C2# 16#85# (UTF-8 for 16#85#/NEL)

Since NEL is recognized in UTF-8 mode, the output is three lines:

VT works
FF works
NEL works

Tested on x86_64-pc-linux-gnu, committed on trunk

2011-11-21  Robert Dewar  <de...@adacore.com>

        * s-utf_32.adb (Is_UTF_32_Line_Terminator): Recognize NEL as
        line terminator.
        * sinput.ads: Add section on Handling of Source Line Terminators.
        * types.ads (Line_Terminator): Adjust comments.

Index: types.ads
===================================================================
--- types.ads   (revision 181556)
+++ types.ads   (working copy)
@@ -102,12 +102,8 @@
    --  Graphic characters, as defined in ARM
 
    subtype Line_Terminator is Character range ASCII.LF .. ASCII.CR;
-   --  Line terminator characters (LF, VT, FF, CR)
-   --
-   --  This definition is dubious now that we have two more wide character
-   --  sequences that constitute a line terminator. Every reference to this
-   --  subtype needs checking to make sure the wide character case is handled
-   --  appropriately. ???
+   --  Line terminator characters (LF, VT, FF, CR). For further details,
+   --  see the extensive discussion of line termination in the Sinput spec.
 
    subtype Upper_Half_Character is
      Character range Character'Val (16#80#) .. Character'Val (16#FF#);
Index: sinput.ads
===================================================================
--- sinput.ads  (revision 181556)
+++ sinput.ads  (working copy)
@@ -6,7 +6,7 @@
 --                                                                          --
 --                                 S p e c                                  --
 --                                                                          --
---          Copyright (C) 1992-2010, Free Software Foundation, Inc.         --
+--          Copyright (C) 1992-2011, Free Software Foundation, Inc.         --
 --                                                                          --
 -- GNAT is free software;  you can  redistribute it  and/or modify it under --
 -- terms of the  GNU General Public License as published  by the Free Soft- --
@@ -43,7 +43,7 @@
 --    described in RM 2.2 (13). Any of the characters FF, LF, CR or VT or any
 --    wide character that is a Line or Paragraph Separator acts as an end of
 --    logical line in this sense, and it is essentially irrelevant whether one
---    or more appears in sequence (since if sequence of such characters is
+--    or more appears in sequence (since if a sequence of such characters is
 --    regarded as separate ends of line, then the intervening logical lines
 --    are null in any case).
 
@@ -451,6 +451,75 @@
                            Internal_Source'Unrestricted_Access;
    --  Pointer to internal source buffer
 
+   -----------------------------------------
+   -- Handling of Source Line Terminators --
+   -----------------------------------------
+
+   --  In this section we discuss in detail the issue of terminators used to
+   --  terminate source lines. The RM says that one or more format effectors
+   --  (other than horizontal tab) end a source line, and defines the set of
+   --  such format effectors, but does not talk about exactly how they are
+   --  represented in the source program (since in general the RM is not in
+   --  the business of specifying source program formats).
+
+   --  The type Types.Line_Terminator is defined as a subtype of Character
+   --  that includes CR/LF/VT/FF. The most common line enders in practice
+   --  are CR (some MAC systems), LF (Unix systems), and CR/LF (DOS/Windows
+   --  systems). Any of these sequences is recognized as ending a physical
+   --  source line, and if multiple such terminators appear (e.g. LF/LF),
+   --  then we consider we have an extra blank line.
+
+   --  VT and FF are recognized as terminating source lines, but they are
+   --  considered to end a logical line instead of a physical line, so that
+   --  the line numbering ignores such terminators. The use of VT and FF is
+   --  mandated by the standard, and correctly handled in a conforming manner
+   --  by GNAT, but their use is not recommended.
+
+   --  In addition to the set of characters defined by the type in Types, in
+   --  wide character encoding, then the codes returning True for a call to
+   --  System.UTF_32.Is_UTF_32_Line_Terminator are also recognized as ending
+   --  a physical source line. This includes the standard codes defined above
+   --  in addition to NEL (NEXT LINE), LINE SEPARATOR and PARAGRAPH SEPARATOR.
+   --  Again, as in the case of VT and FF, the standard requires we recognize
+   --  these as line terminators, but we consider them to be logical line
+   --  terminators. The only physical line terminators recognized are the
+   --  standard ones (CR, LF, or CR/LF).
+
+   --  However, we do not recognize the NEL (16#85#) character as having the
+   --  significance of an end of line character when operating in normal 8-bit
+   --  Latin-n input mode for the compiler. Instead the rule in this mode is
+   --  that all upper half control codes (16#80# .. 16#9F#) are illegal if they
+   --  occur in program text, and are ignored if they appear in comments.
+
+   --  First, note that this behavior is fully conforming with the standard.
+   --  The standard has nothing whatever to say about source representation
+   --  and implementations are completely free to make there own rules. In
+   --  this case, in 8-bit mode, GNAT decides that the 16#0085# character is
+   --  not a representation of the NEL character, even though it looks like it.
+   --  If you have NEL's in your program, which you expect to be treated as
+   --  end of line characters, you must use a wide character encoding such as
+   --  UTF-8 for this code to be recognized.
+
+   --  Second, an explanation of why we take this slightly surprising choice.
+   --  We have never encountered anyone actually using the NEL character to
+   --  end lines. One user raised the issue as a result of some experiments,
+   --  but no one has ever submitted a program encoded this way, in any of
+   --  the possible encodings. It seems that even when using wide character
+   --  codes extensively, the normal approach is to use standard line enders
+   --  (LF or CR/LF). So the failure to recognize NEL in this mode seems to
+   --  have no practical downside.
+
+   --  Moreover, what we have seen in a significant number of programs from
+   --  multiple sources is the practice of writing all program text in lower
+   --  half (ASCII) form, but using UTF-8 encoded wide characters freely in
+   --  comments, where the comments are terminated by normal line endings
+   --  (LF or CR/LF). The comments do not contain NEL codes, but they can and
+   --  do contain other UTF-8 encoding sequences where one of the bytes is the
+   --  NEL code. Now such programs can of course be compiled in UTF-8 mode,
+   --  but in practice they also compile fine in standard 8-bit mode without
+   --  specifying a character encoding. Since this is common practice, it would
+   --  be a signficant upwards incompatibility to recognize NEL in 8-bit mode.
+
    -----------------
    -- Subprograms --
    -----------------
Index: s-utf_32.adb
===================================================================
--- s-utf_32.adb        (revision 181556)
+++ s-utf_32.adb        (working copy)
@@ -6,7 +6,7 @@
 --                                                                          --
 --                                 B o d y                                  --
 --                                                                          --
---          Copyright (C) 2005-2010, Free Software Foundation, Inc.         --
+--          Copyright (C) 2005-2011, Free Software Foundation, Inc.         --
 --                                                                          --
 -- GNAT is free software;  you can  redistribute it  and/or modify it under --
 -- terms of the  GNU General Public License as published  by the Free Soft- --
@@ -6182,6 +6182,7 @@
    function Is_UTF_32_Line_Terminator (U : UTF_32) return Boolean is
    begin
       return U in 10 .. 13     -- Ascii.LF Ascii.VT Ascii.FF Ascii.CR
+        or else U = 16#00085#  -- NEL
         or else U = 16#02028#  -- LINE SEPARATOR
         or else U = 16#02029#; -- PARAGRAPH SEPARATOR
    end Is_UTF_32_Line_Terminator;

[Ada] Properly recognize NEL as end of line in UTF-8 encoding mode

Reply via email to