Paul Eggert wrote: > > /* The functions defined in this file assume the "C" locale and a character > > set without diacritics (ASCII-US or EBCDIC-US or something like that). > > Even if the "C" locale on a particular system is an extension of the > > ASCII > > character set (like on BeOS, where it is UTF-8, or on AmigaOS, where it > > is ISO-8859-1), the functions in this file recognize only the ASCII > > characters. More precisely, one of the string arguments must be an ASCII > > string with additional restrictions. */ > > The intent here is to act like the "C", where all single bytes count > as characters, ...
The "C" locale is not always a unibyte locale. On some systems, like BeOS or MacOS X, even the C locale is a multibyte locale (with UTF-8 encoding). Therefore most of our "c-*" modules should better be called "ascii-*" or "unibyte-*". > even when some other locale is in effect, right? The purpose is either to provide the semantics of a unibyte locale without actually switching locales, or to provide the correct locale dependent semantics through a speedier algorithm. I now see where the confusion comes from: the first paragraph of comments highlights the first purpose; the second highlights the second purpose; and they contradict each other. > > This function is safe to be called, even in a multibyte locale, if NEEDLE > > ... > > I think this claim isn't true for some weird non-ASCII encoding > schemes like DBCS-Host. Are these used as locale encodings? Many of these so-called DBCS encodings are stateful and therefore not usable as locale encodings. Non-nearly-ASCII-compatible encodings don't appear in the world where GNU programs are deployed. I added a check to gperf with the effect that if a gperf-generated program is compiled in an environment with an encoding that is not nearly ASCII compatible (testing only the printable characters, not the control characters), it will lead to a compilation failure, and ask for a bug report. No such bug report has ever been filed. > Also, it wouldn't be true if someone introduced a new encoding that > varies from ASCII in some other way. This is true, but the pace of creation of new encodings has slowed down a lot in the last years. The last created new encoding scheme is GB-18030, and that's 6 years ago. I expect that from now on, only minor variations of existing encodings will be created. > How about changing the wording to be: > > In all practical encodings that we know of that are extensions or > near-extensions of ASCII, this function is safe to be called, even > in a multibyte locale, if NEEDLE ... The "nearly an ASCII extension" assumption is so ubiquitous, think of (c >= '0') tests and similar. You really find it's worth mentioning? > Another possibility would be to remove the claim entirely But it's important to know that c_strstr (s, "x") is not safe and c_strstr (s, "123") is also not safe. The programmer needs to have the precise criteria. > > foundneedle: > > return (char*) haystack; > > The usual GNU style puts a space before the "*". Yes. Fixed. How about these comments? They don't talk about the C locale any more. Bruno ======================== lib/strstr.h ============================== /* Searching in a string. Copyright (C) 2001-2003, 2006 Free Software Foundation, Inc. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */ /* The functions defined in this file assume a nearly ASCII compatible character set. */ #ifdef __cplusplus extern "C" { #endif /* Find the first occurrence of NEEDLE in HAYSTACK. This function is safe to be called, even in a multibyte locale, if NEEDLE 1. consists solely of printable ASCII characters excluding '\\' and '~' [this restriction is needed because of Shift_JIS and JOHAB] or of the control ASCII characters '\a' '\b' '\f' '\n' '\r' '\t' '\v' [this restriction is needed because of VISCII], and 2. has at least length 2 [this restriction is needed because of BIG5, BIG5-HKSCS, GBK, GB18030, Shift_JIS, JOHAB], and 3. does not consist entirely of decimal digits, or has at least length 4 [this restricion is needed because of GB18030]. This function is also safe to be called, even in a multibyte locale, if HAYSTACK and NEEDLE are known to both consist solely of printable ASCII characters excluding '\\' and '~'. */ extern char *c_strstr (const char *haystack, const char *needle); #ifdef __cplusplus } #endif