Karl Berry wrote: > What I don't understand with your proposal is how this magical url > vs. file bit is known.
When producing/printing an error message, this information has to come from the program. The program certainly knows whether it has used a function like fopen() or open_url() to access the contents it is complaining about. When parsing an error message, such as in Emacs' 'grep' mode, then - unless the program is known to inspect only local files or only URLs - a heuristic is indeed required: > For example, if someone runs Henri's validator > in Emacs, it seems to me that next-error is going to have heuristically > guess whether it is a url to know how to interpret %'s. Yes, it has to heuristically guess that. This heuristic is well-known: it happens any time a user enters a file name into a browser's URL field. For example, KDE 3 konqueror implements this heuristic by looking whether a file with that name exists in the current directory: $ konqueror 'file:/foo.txt' tests whether the current directory has a subdirectory 'file:' and this one has a file 'foo.txt'. If yes, this file is opened. Otherwise the string is interpreted as an URL and canonicalized to "file:///foo.txt". If you find that the need of a heuristic is a problem, then the solution that comes to mind is to use different markers to distinguish the two cases: Use "..." to enclose a file name, but <...> to enclose an URL. Then the parsing is unambiguous. (The use of '<' '>' to mark an URL is widely used, see RFC 2396 section 2.4.3.) I'm still in favour of using the same escape syntax for the file names, namely '%nn', so that end users see only one escape syntax. This heuristic-avoiding proposal looks like this: ============================================================================== Proposal B: - For output of a filename in an error message: The escaped syntax is required if the filename contains a ':' or newline, or starts with a '"' or '<'. It may also be used for other filenames. In escaped syntax, a '"' is output. then the filename is output, with escaping: - Occurrences of '"' and '%' and newline are escaped as %22 and %25 and %0A, respectively, - Other US-ASCII characters may be escaped in %nn syntax as well, where nn is the hexadecimal notation (case insignificant) of the byte value in the US-ASCII encoding. Finally a '"' is output. Otherwise, the filename is output literally, without modifications. - For output of an URL in an error message: Remember that an URL or URI always has %nn escaping already enabled (see RFC 2396, section 2.4.2) and therefore does not contain the characters '"', '<', '>' (see RFC 2396, section 2.4.3). A '<' is output, then the URL or URI is output literally, then a '>' is output. - For parsing: - If the first character is a '<'. then it's an URL or URI. It ends at the next '>' character. The URL or URI is the substring from the leading '<' (exclusive) to the next '>' (exclusive). - If the first character is a '"', then it's a filename in escaped syntax. It ends at the next '"' character. Unescaping converts each %nn to the US-ASCII character with byte value nn. - Otherwise, it's a filename, and the filename ends at the first ':' or end of line. Properties of this proposal: - The proposal handles both filenames and URLs or URIs. - The user only sees one escape syntax, namely the %nn notation defined by RFC 2396. (In other proposals, different escape syntaxes were used for filenames and for URLs.) - The user is already familiar with the '<...>' notation for URLs. - The user can copy&paste URLs from the output in all cases. (Just the part inside '<...>'.) - In most cases, filenames are output literally and can therefore be copy&pasted by the user. This holds both for filenames on Unix ("/f/00/bar" syntax) and filenames on Windows ("c:\f\00\bar" syntax). - The output does not contain newlines; therefore a program that parses the output can proceed by reading line by line. - The output can be parsed without prerequisite knowledge whether a location is a filename or an URL. It's explicit. - The output can be parsed without reference to a particular encoding for non-ASCII characters: Unescaping is limited US-ASCII characters inside filenames. ============================================================================== ============================================================================== Proposal A: - For output of a filename in an error message: The escaped syntax is required if the filename contains a ':' or newline, or starts with a '"' or '<'. It may also be used for other filenames. In escaped syntax, a '"' is output. then the filename is output, with escaping: - Occurrences of '"' and '%' and newline are escaped as %22 and %25 and %0A, respectively, - Other US-ASCII characters may be escaped in %nn syntax as well, where nn is the hexadecimal notation (case insignificant) of the byte value in the US-ASCII encoding. Finally a '"' is output. Otherwise, the filename is output literally, without modifications. - For output of an URL in an error message: Remember that an URL or URI always has %nn escaping already enabled (see RFC 2396, section 2.4.2) and therefore does not contain the characters '"', '<', '>' (see RFC 2396, section 2.4.3). A '"' is output, then the URL or URI is output literally, then a '"' is output. - For parsing: - If the first character is a '"', then it's a filename or URL in escaped syntax. It ends at the next '"' character. Unescaping converts each %nn to the US-ASCII character with byte value nn. Some heuristic is used to distinguish between filename (with unescaping) and URL (no unescaping performed). - Otherwise, it's a filename, and the filename ends at the first ':' or end of line. Properties of this proposal: - The proposal handles both filenames and URLs or URIs. - The user only sees one escape syntax, namely the %nn notation defined by RFC 2396. (In other proposals, different escape syntaxes were used for filenames and for URLs.) - The user can copy&paste URLs from the output in all cases. (Just the part inside '"..."'.) - In most cases, filenames are output literally and can therefore be copy&pasted by the user. This holds both for filenames on Unix ("/f/00/bar" syntax) and filenames on Windows ("c:\f\00\bar" syntax). - The output does not contain newlines; therefore a program that parses the output can proceed by reading line by line. - The output can be parsed without reference to a particular encoding for non-ASCII characters: Unescaping is limited US-ASCII characters inside filenames. ============================================================================== > Anyway, this decision seems like a judgement to me, not something that > can be considered provably better. Avoiding the heuristic is undoubtedly a good point. I prefer proposal B over A myself. The use of '<' and '>' as delimiters for an URL also matches the convention used in email and elsewhere on the web. > And rms is the one whose judgement counts. I'll write him about it unless > you want to do so. You're welcome to write to him. Either proposals A and B, or only B, or others, as you like. I think he will need the "Properties" sections of the proposals - it's not obvious for someone who has not participated in this discussion to see which proposal has which properties. Bruno