On 01/20/2016 11:59 AM, Aharon Robbins wrote:
I don't see that. In regex.h:
| struct re_pattern_buffer
| {
| /* Space that holds the compiled pattern. It is declared as
| `unsigned char *' because its elements are sometimes used as
| array indexes. */
| unsigned char *__REPB_PREFIX(buffer);
Ah, that's probably the problem, and is why your version needs extra
casts between unsigned char * and re_dfa_t *. In gnulib, that last line
reads 'struct re_dfa_t *__REPB_PREFIX(buffer);', which is what the
buffer really is, and avoids the need for casts later.
Since malloc (0) is well-defined to return either NULL or a valid pointer
to a zero-byte object that can be freed, the code is working as-is.
Not on older systems. Although yes, I'm not sure I still support such
systems. I don't know which checker complained. I think it was valgrind
at some point in the past.
Those older systems are long dead, as this requirement was standardized
in C89 and even the oldest systems we formerly supported (SunOS 4, if I
recall) conformed to the standard long ago. valgrind 3.11 does not
complain about simple uses like free (malloc (0)) on my platform (Fedora
23 x86-64). Perhaps older versions of valgrind complain on some
platforms, but I'd rather not worry about that; it's really a valgrind bug.
I'd rather look for a solution that involves silencing the checking
without making the code bulkier and typically slower.
This is in compilation, not execution; not sure the difference
is really noticable.
There is some runtime slowdown and code bloat. You're right, it's not
significant, but it's the principle of the thing: I'd rather not bloat
the code just to pacify a busted checker.
If a byte value given inside [...] is greater than 127, it should
be left alone to be matched as-is (no arbitrary limits).
Hah! We recently ran into a similar problem with 'grep'. Previously,
part of the grep code assumed that unibyte locales cannot have encoding
errors, so in a unibyte locale one can just use the traditional
byte-oriented algorithm without worrying about mbrlen and the like.
Other parts of the grep code checked for encoding errors in the usual
way, e.g., with mbrlen. Unfortunately, the assumption about unibyte
locales is incorrect, and the inconsistencies between the different
parts of 'grep' caused confusing and arguably incorrect behavior which
took a bit of hackery to resolve; see the thread starting at
<http://bugs.gnu.org/20526#86>.
As far as 'grep' is concerned, it'll trust what regcomp does here, so we
do have some freedom to change the code in this area. However, it looks
to me like your patch would do the wrong thing for unibyte locales where
btowc (b) returns a value that neither b nor WEOF. Also, the rest the
code assumes that if btowc returns WEOF in a multibyte locale then there
won't be a match (see the setup code in init_dfa, and I have the nagging
feeling that this assumption is embedded elsewhere). So, how about the
attached more-conservative patch instead?
/* Build single byte matching table for this equivalence class. */
+ char_buf[1] = (unsigned char) '\0';
This should be unnecessary, as the rest of the code shouldn't be looking
at that byte. Is this something to pacify a lint checker?
Or valgrind at some point. I think a static checker that someone submitted
a report from.
It might be helpful to get at the bottom of this. If it's some buggy
static checker then I would rather not worry about it.
- wc = wc2;
+ wc = (wint_t) wc2;
wc and wc2 are both integers so the cast is unnecessary. Similarly elsewhere.
There was a compiler (maybe VMS) for which this was necessary.
Again, it'd be helpful to know what the problem actually was. I expect
it was just a warning, which is fine to ignore. It's valid standard C
code, for what it's worth.
From 8f17f9b379db88ce39c79e9bb4184a37bd04b363 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Thu, 21 Jan 2016 09:06:20 -0800
Subject: [PATCH] regex: treat [x] as x if x is a unibyte encoding error
Problem reported by Aharon Robbins in:
http://lists.gnu.org/archive/html/bug-gnulib/2016-01/msg00091.html
* lib/regcomp.c (parse_byte) [_LIBC && RE_ENABLE_I18N]: New function.
(build_range_exp) [_LIBC && RE_ENABLE_I18N]: Use it.
---
ChangeLog | 8 ++++++++
lib/regcomp.c | 17 +++++++++++++++--
2 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 181f709..a870e86 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2016-01-21 Paul Eggert <eggert@cs.ucla.edu>
+
+ regex: treat [x] as x if x is a unibyte encoding error
+ Problem reported by Aharon Robbins in:
+ http://lists.gnu.org/archive/html/bug-gnulib/2016-01/msg00091.html
+ * lib/regcomp.c (parse_byte) [_LIBC && RE_ENABLE_I18N]: New function.
+ (build_range_exp) [_LIBC && RE_ENABLE_I18N]: Use it.
+
2016-01-21 Bruno Haible <bruno@clisp.org>
hash-pjw-bare: fix comment
diff --git a/lib/regcomp.c b/lib/regcomp.c
index f86ce06..2ba99d7 100644
--- a/lib/regcomp.c
+++ b/lib/regcomp.c
@@ -2696,6 +2696,19 @@ parse_dup_op (bin_tree_t *elem, re_string_t *regexp, re_dfa_t *dfa,
#define BRACKET_NAME_BUF_SIZE 32
#ifndef _LIBC
+
+# ifdef RE_ENABLE_I18N
+/* Convert the byte B to the corresponding wide character. In a
+ unibyte locale, treat B as itself if it is an encoding error.
+ In a multibyte locale, return WEOF if B is an encoding error. */
+static wint_t
+parse_byte (unsigned char b, re_charset_t *mbcset)
+{
+ wint_t wc = __btowc (b);
+ return wc == WEOF && !mbcset ? b : wc;
+}
+#endif
+
/* Local function for parse_bracket_exp only used in case of NOT _LIBC.
Build the range expression which starts from START_ELEM, and ends
at END_ELEM. The result are written to MBCSET and SBCSET.
@@ -2747,9 +2760,9 @@ build_range_exp (const reg_syntax_t syntax,
: ((end_elem->type == COLL_SYM) ? end_elem->opr.name[0]
: 0));
start_wc = ((start_elem->type == SB_CHAR || start_elem->type == COLL_SYM)
- ? __btowc (start_ch) : start_elem->opr.wch);
+ ? parse_byte (start_ch, mbcset) : start_elem->opr.wch);
end_wc = ((end_elem->type == SB_CHAR || end_elem->type == COLL_SYM)
- ? __btowc (end_ch) : end_elem->opr.wch);
+ ? parse_byte (end_ch, mbcset) : end_elem->opr.wch);
if (start_wc == WEOF || end_wc == WEOF)
return REG_ECOLLATE;
else if (BE ((syntax & RE_NO_EMPTY_RANGES) && start_wc > end_wc, 0))
--
2.5.0