Re: new module c-strstr

Bruno Haible Fri, 18 Aug 2006 11:48:08 -0700

Paul Eggert wrote:
> > /* The functions defined in this file assume the "C" locale and a character
> >    set without diacritics (ASCII-US or EBCDIC-US or something like that).
> >    Even if the "C" locale on a particular system is an extension of the 
> > ASCII
> >    character set (like on BeOS, where it is UTF-8, or on AmigaOS, where it
> >    is ISO-8859-1), the functions in this file recognize only the ASCII
> >    characters.  More precisely, one of the string arguments must be an ASCII
> >    string with additional restrictions.  */
> 
> The intent here is to act like the "C", where all single bytes count
> as characters, ...


The "C" locale is not always a unibyte locale. On some systems, like
BeOS or MacOS X, even the C locale is a multibyte locale (with UTF-8
encoding). Therefore most of our "c-*" modules should better be called
"ascii-*" or "unibyte-*".

> even when some other locale is in effect, right?

The purpose is either to provide the semantics of a unibyte locale without
actually switching locales, or to provide the correct locale dependent
semantics through a speedier algorithm. I now see where the confusion comes
from: the first paragraph of comments highlights the first purpose; the
second highlights the second purpose; and they contradict each other.

> >    This function is safe to be called, even in a multibyte locale, if NEEDLE
> >    ...
> 
> I think this claim isn't true for some weird non-ASCII encoding
> schemes like DBCS-Host.

Are these used as locale encodings? Many of these so-called DBCS encodings
are stateful and therefore not usable as locale encodings.

Non-nearly-ASCII-compatible encodings don't appear in the world where GNU
programs are deployed. I added a check to gperf with the effect that if a
gperf-generated program is compiled in an environment with an encoding
that is not nearly ASCII compatible (testing only the printable characters,
not the control characters), it will lead to a compilation failure, and
ask for a bug report. No such bug report has ever been filed.

> Also, it wouldn't be true if someone introduced a new encoding that
> varies from ASCII in some other way.

This is true, but the pace of creation of new encodings has slowed down
a lot in the last years. The last created new encoding scheme is
GB-18030, and that's 6 years ago. I expect that from now on, only minor
variations of existing encodings will be created.

> How about changing the wording to be: 
> 
>    In all practical encodings that we know of that are extensions or
>    near-extensions of ASCII, this function is safe to be called, even
>    in a multibyte locale, if NEEDLE ...

The "nearly an ASCII extension" assumption is so ubiquitous, think of
(c >= '0') tests and similar. You really find it's worth mentioning?

> Another possibility would be to remove the claim entirely

But it's important to know that   c_strstr (s, "x")  is not safe and
c_strstr (s, "123")  is also not safe. The programmer needs to have the
precise criteria.

> > foundneedle:
> >   return (char*) haystack;
> 
> The usual GNU style puts a space before the "*".

Yes. Fixed.

How about these comments? They don't talk about the C locale any more.

Bruno


======================== lib/strstr.h ==============================
/* Searching in a string.
   Copyright (C) 2001-2003, 2006 Free Software Foundation, Inc.

   This program is free software; you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 2, or (at your option)
   any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program; if not, write to the Free Software Foundation,
   Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.  */


/* The functions defined in this file assume a nearly ASCII compatible
   character set.  */


#ifdef __cplusplus
extern "C" {
#endif

/* Find the first occurrence of NEEDLE in HAYSTACK.
   This function is safe to be called, even in a multibyte locale, if NEEDLE
     1. consists solely of printable ASCII characters excluding '\\' and '~'
        [this restriction is needed because of Shift_JIS and JOHAB]
        or of the control ASCII characters '\a' '\b' '\f' '\n' '\r' '\t' '\v'
        [this restriction is needed because of VISCII], and
     2. has at least length 2
        [this restriction is needed because of BIG5, BIG5-HKSCS, GBK, GB18030,
         Shift_JIS, JOHAB], and
     3. does not consist entirely of decimal digits, or has at least length 4
        [this restricion is needed because of GB18030].
   This function is also safe to be called, even in a multibyte locale, if
   HAYSTACK and NEEDLE are known to both consist solely of printable ASCII
   characters excluding '\\' and '~'.  */
extern char *c_strstr (const char *haystack, const char *needle);

#ifdef __cplusplus
}
#endif

Re: new module c-strstr

Reply via email to