Re: special characters in filenames in error messages

Bruno Haible Sun, 14 Dec 2008 12:58:13 -0800

Karl Berry wrote:
> What I don't understand with your proposal is how this magical url
> vs. file bit is known.


When producing/printing an error message, this information has to come
from the program. The program certainly knows whether it has used a function
like fopen() or open_url() to access the contents it is complaining about.

When parsing an error message, such as in Emacs' 'grep' mode, then - unless
the program is known to inspect only local files or only URLs - a heuristic
is indeed required:
> For example, if someone runs Henri's validator 
> in Emacs, it seems to me that next-error is going to have heuristically
> guess whether it is a url to know how to interpret %'s.

Yes, it has to heuristically guess that. This heuristic is well-known:
it happens any time a user enters a file name into a browser's URL field.
For example, KDE 3 konqueror implements this heuristic by looking whether
a file with that name exists in the current directory:
  $ konqueror 'file:/foo.txt'
tests whether the current directory has a subdirectory 'file:' and this one
has a file 'foo.txt'. If yes, this file is opened. Otherwise the string is
interpreted as an URL and canonicalized to "file:///foo.txt".

If you find that the need of a heuristic is a problem, then the solution that
comes to mind is to use different markers to distinguish the two cases: Use
"..." to enclose a file name, but <...> to enclose an URL. Then the parsing is
unambiguous. (The use of '<' '>' to mark an URL is widely used, see RFC 2396
section 2.4.3.) I'm still in favour of using the same escape syntax for the
file names, namely '%nn', so that end users see only one escape syntax.
This heuristic-avoiding proposal looks like this:

==============================================================================
Proposal B:

   - For output of a filename in an error message:
     The escaped syntax is required if the filename contains a ':' or
     newline, or starts with a '"' or '<'. It may also be used for other
     filenames.
     In escaped syntax, a '"' is output. then the filename is output, with
     escaping:
       - Occurrences of '"' and '%' and newline are escaped as
         %22 and %25 and %0A, respectively,
       - Other US-ASCII characters may be escaped in %nn syntax as well,
         where nn is the hexadecimal notation (case insignificant)
         of the byte value in the US-ASCII encoding.
     Finally a '"' is output.
     Otherwise, the filename is output literally, without modifications.

   - For output of an URL in an error message:
     Remember that an URL or URI always has %nn escaping already enabled
     (see RFC 2396, section 2.4.2) and therefore does not contain the
     characters '"', '<', '>' (see RFC 2396, section 2.4.3).
     A '<' is output, then the URL or URI is output literally, then a '>'
     is output.

   - For parsing:
     - If the first character is a '<'. then it's an URL or URI. It
       ends at the next '>' character. The URL or URI is the substring
       from the leading '<' (exclusive) to the next '>' (exclusive).
     - If the first character is a '"', then it's a filename in escaped
       syntax. It ends at the next '"' character. Unescaping converts
       each %nn to the US-ASCII character with byte value nn.
     - Otherwise, it's a filename, and the filename ends at the first ':'
       or end of line.

Properties of this proposal:
  - The proposal handles both filenames and URLs or URIs.
  - The user only sees one escape syntax, namely the %nn notation defined
    by RFC 2396. (In other proposals, different escape syntaxes were used
    for filenames and for URLs.)
  - The user is already familiar with the '<...>' notation for URLs.
  - The user can copy&paste URLs from the output in all cases. (Just the
    part inside '<...>'.)
  - In most cases, filenames are output literally and can therefore be
    copy&pasted by the user. This holds both for filenames on Unix
    ("/f/00/bar" syntax) and filenames on Windows ("c:\f\00\bar" syntax).
  - The output does not contain newlines; therefore a program that parses
    the output can proceed by reading line by line.
  - The output can be parsed without prerequisite knowledge whether a
    location is a filename or an URL. It's explicit.
  - The output can be parsed without reference to a particular encoding
    for non-ASCII characters: Unescaping is limited US-ASCII characters
    inside filenames.

==============================================================================

==============================================================================
Proposal A:

   - For output of a filename in an error message:
     The escaped syntax is required if the filename contains a ':' or
     newline, or starts with a '"' or '<'. It may also be used for other
     filenames.
     In escaped syntax, a '"' is output. then the filename is output, with
     escaping:
       - Occurrences of '"' and '%' and newline are escaped as
         %22 and %25 and %0A, respectively,
       - Other US-ASCII characters may be escaped in %nn syntax as well,
         where nn is the hexadecimal notation (case insignificant)
         of the byte value in the US-ASCII encoding.
     Finally a '"' is output.
     Otherwise, the filename is output literally, without modifications.

   - For output of an URL in an error message:
     Remember that an URL or URI always has %nn escaping already enabled
     (see RFC 2396, section 2.4.2) and therefore does not contain the
     characters '"', '<', '>' (see RFC 2396, section 2.4.3).
     A '"' is output, then the URL or URI is output literally, then a '"'
     is output.

   - For parsing:
     - If the first character is a '"', then it's a filename or URL in
       escaped syntax. It ends at the next '"' character. Unescaping converts
       each %nn to the US-ASCII character with byte value nn. Some heuristic
       is used to distinguish between filename (with unescaping) and URL
       (no unescaping performed).
     - Otherwise, it's a filename, and the filename ends at the first ':'
       or end of line.

Properties of this proposal:
  - The proposal handles both filenames and URLs or URIs.
  - The user only sees one escape syntax, namely the %nn notation defined
    by RFC 2396. (In other proposals, different escape syntaxes were used
    for filenames and for URLs.)
  - The user can copy&paste URLs from the output in all cases. (Just the
    part inside '"..."'.)
  - In most cases, filenames are output literally and can therefore be
    copy&pasted by the user. This holds both for filenames on Unix
    ("/f/00/bar" syntax) and filenames on Windows ("c:\f\00\bar" syntax).
  - The output does not contain newlines; therefore a program that parses
    the output can proceed by reading line by line.
  - The output can be parsed without reference to a particular encoding
    for non-ASCII characters: Unescaping is limited US-ASCII characters
    inside filenames.

==============================================================================

> Anyway, this decision seems like a judgement to me, not something that
> can be considered provably better.

Avoiding the heuristic is undoubtedly a good point. I prefer proposal B over A
myself. The use of '<' and '>' as delimiters for an URL also matches the
convention used in email and elsewhere on the web.

> And rms is the one whose judgement counts.  I'll write him about it unless
> you want to do so. 

You're welcome to write to him. Either proposals A and B, or only B, or
others, as you like. I think he will need the "Properties" sections of the
proposals - it's not obvious for someone who has not participated in this
discussion to see which proposal has which properties.

Bruno

Re: special characters in filenames in error messages

Reply via email to