Its 2020.
GREP really should support Unicode. (UTF-16, UTF-8, with and without signature)
Format recognition wouldn't have to be automatic; command line switches would
be sufficient.
I am using version Git for Windows v2.25.0
Kind regards
: GREP does not support Unicode
Hi Carlo!
On Sat, 8 Aug 2020 15:13:40 +0200
wrote:
> Its 2020.
>
> GREP really should support Unicode. (UTF-16, UTF-8, with and without
> signature) Format recognition wouldn't have to be automatic; command
> line switches would be sufficie
The following patch increase performance of grep when looking at
binary data, without any side effects:
Summary
'cd grep; ./src/grep -Pc foo
/Users/carlo/Downloads/FreeBSD-13.0-BETA2-amd64.vhd' ran
1.77 ± 0.02 times faster than 'cd grep.base; ./src/grep -Pc foo
/Users/
And of course it has side effects (as shown by the test suite), and
would only help (if fixed) when the needle is a fixed string, which is
3x slower than doing -F, -G or -E.
Apologies for the distraction.
Carlo
, and JIT might be able to run the alteration fast enough
for most cases.
Hopefully this tiny change is better than the status quo, though.
Carlo
0001-pcre-allow-more-than-1-regular-expression.patch
Description: Binary data
On Sat, Oct 16, 2021 at 12:50 AM Paul Eggert wrote:
>
> On 10/16/21 12:00 AM, Carlo Arenas wrote:
> > With this patch, multiple expressions (from -e or -f) are now
> > acceptable with -P for easier side by side comparison with the other
> > supported engines.
>
>
On Sun, Nov 7, 2021 at 4:30 PM Paul Eggert wrote:
>
> On 11/7/21 11:26, Carlo Marcelo Arenas Belón wrote:
> > Mostly a bug by bug translation of the original code to the PCRE2 API.
> > but includes a couple of fixes as well that might be worth doing in
> > independen
On Mon, Nov 8, 2021 at 11:53 AM Paul Eggert wrote:
>
> On 11/8/21 01:47, Carlo Arenas wrote:
> > On Sun, Nov 7, 2021 at 4:30 PM Paul Eggert wrote:
>
> > Let me know how to help otherwise.
>
> The main thing from my point of view is that I'd like to know what tho
No
PCRE2 uses size_t and it is the same (or similar) not signed type when
passed to sljit, so no Undefined Behaviour or overflow.
We might keep the limit in PCRE2 though, as it should be IMHO far
smaller anyway.
Carlo
Car
On Tue, Nov 9, 2021 at 10:28 AM Paul Eggert wrote:
>
> Than
On Tue, Nov 9, 2021 at 4:40 PM Paul Eggert wrote:
>
> On 11/9/21 11:04, Carlo Marcelo Arenas Belón wrote:
> > Severity: wishlist
> >
> > There are times, when the expression is too simple or will not be used too
> > often to justify the extra time in -P that i
On Sun, Nov 14, 2021 at 12:45 PM Paul Eggert wrote:
>
> On 11/9/21 02:58, Carlo Marcelo Arenas Belón wrote:
> > Sadly, hadn't been able to generate a release,
>
> Does this mean you're having trouble running 'make dist'? If so, what's
> the troub
On Sun, Nov 14, 2021 at 2:45 PM Jeffrey Walton wrote:
>
> On Sun, Nov 14, 2021 at 5:26 PM Carlo Arenas wrote:
> > On Sun, Nov 14, 2021 at 12:45 PM Paul Eggert wrote:
> > > ...
> > using idx_t instead of size_t should be fine (if only halves the max
> > size
On Sun, Nov 14, 2021 at 3:18 PM Carlo Arenas wrote:
> On Sun, Nov 14, 2021 at 2:45 PM Jeffrey Walton wrote:
> > On Sun, Nov 14, 2021 at 5:26 PM Carlo Arenas wrote:
> > > On Sun, Nov 14, 2021 at 12:45 PM Paul Eggert wrote:
> > > > ...
> > > using idx_t in
On Sun, Nov 14, 2021 at 7:18 PM Paul Eggert wrote:
> On 11/14/21 14:25, Carlo Arenas wrote:
> > using idx_t instead of size_t should be fine (if only halves the max
> > size of the objects managed), but I am concerned that assuming
> > PCRE2_SIZE_MAX is always equivalent to
you want
for your usecase and why it would be better if you quote it.
time echo "axyz" | grep '[abcd]xyz'
should behave as you expect, regardless of what the current directory has.
Carlo
Reported to PCRE[1] with mention of GNU grep being also affected.
[1] https://github.com/PCRE2Project/pcre2/issues/185
From c2d4a43b5b15df7c8853d591bf6ae872c602ed14 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?=
Date: Fri, 6 Jan 2023 19:34:56 -0800
Subject
Noticed while testing the previous patch, and which resulted in tests
being skipped for the wrong reason.
Carlo
0001-pcre-only-use-UTF-when-available-in-the-library.patch
Description: Binary data
introduce any changing behaviour or even code changes (because of the
expected optimization), but agree might have been too clever without a
corresponding explanation.
Carlo
Your suggested code doesn't address
that, it merely changes the error message with one that would be IMHO
even less clear and worsens the problem.
Using a non Unicode PCRE library is perfectly fine, and there is no
"undefined behavior" risk, and indeed `grep -P` without the UTF flag
is exactly what the alternate path uses and what is recommended for
speed, so?
Carlo
unicode is missing, and take into consideration
those tests that set multibyte locale were successful after my change,
so they will also need changes as they would misbehave silently
otherwise.
Carlo
gly; the loop is broken if any character is added to any of
the `()` branches which might mean that this is also unlikely to
happen in well formed expressions.
Carlo
PS. -P doesn't loop and neither does `echo a | grep -E '((a|())|())+'`
nor '(()|(a|()))+` nor `(()|(()|a))+'`
On Mon, Apr 3, 2023 at 2:50 PM Paul Eggert wrote:
>
>* Disable PCRE2_UCP unless PCRE2 10.35 or higher.
this is because of a bug in JIT, alternatively JIT could be disabled
>* If ignoring case and PCRE2_MATCH_INVALID_UTF is defined, then
> enable PCRE2_NO_START_OPTIMIZE unless PCRE2 10.36
the next PCRE2 release.
Presume PCRE2 is a typo and should have been "git" here?
FWIW the PCRE2 fix[1] has been released already with 10.35 and
backporting to the Ubuntu 20.04 package that crashed in the original
report would also solve the crash with 10.34.
Carlo
[1] https://gith
On Mon, Apr 3, 2023 at 11:23 PM Paul Eggert wrote:
>
> On 2023-04-03 23:17, Carlo Arenas wrote:
> > On Mon, Apr 3, 2023 at 2:50 PM Paul Eggert wrote:
> >>
> >> * Disable PCRE2_UCP unless PCRE2 10.35 or higher.
> >
> > this is because of a bug i
therefore `\d` meaning `[0-9]` seems
"normal".
Carlo
CC: changed to the real email address for PCRE2 development, for full
context on this thread use [4]
[1] https://github.com/PCRE2Project/pcre2/pull/186
[2] https://unicode.org/reports/tr18/
[3] https://regex101.com/r/S5RW4c/1
[4] htt
t PCRE2 already does not implement every recommended aspect
> of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support
> "\p{gc=Decimal_Number}".
Not sure I follow the whole logic here, but PCRE2[3] (search for
"general category" which is what the &quo
The original code was done in a way that would be useful during
porting, but that would hinder future work unnecessarily.
Carlo
0001-pcre-correct-overpessimistic-error-checking-of-pcre2.patch
Description: Binary data
On Tue, Apr 11, 2023 at 3:11 PM Paul Eggert wrote:
>
> On 4/10/23 23:47, Carlo Arenas wrote:
> > The original code was done in a way that would be useful during
> > porting, but that would hinder future work unnecessarily.
>
> Thanks, but wouldn't the attached patch
You can do that already with PCRE2 and a lookbehind:
echo abcedc|ggrep --color -P '(?=b)c'
On Tue, Apr 11, 2023 at 11:51 PM Carlo Arenas wrote:
>
> echo abcedc|ggrep --color -P '(?=b)c'
typo:
echo abcedc|ggrep --color -P '(?<=b)c'
`ggrep`, would be called grep in your environment
Just some nitpicking, but could we use single quotes around the '𝄞'
character in pcre-utf8-bug224 instead of double quotes?
Carlo
On Sat, May 13, 2023 at 7:48 AM Andreas Schwab wrote:
>
> On Mai 13 2023, Carlo Marcelo Arenas Belón wrote:
>
> > on linux m68k.
>
> ???
Well; the report didn't provide much information, so I made an educated guess.
Would you provide a more accurate description?
Al
That is a test for a bug that your system image has but that is not
relevant to grep (mbrlen doesn't correctly handle a call with a len of
0).
Carlo
On Fri, May 19, 2023 at 12:43 PM Carlo Marcelo Arenas Belón
wrote:
>
> On Thu, May 18, 2023 at 10:09:38PM +0200, Jim Meyering wrote:
> > On Thu, May 18, 2023 at 2:44 PM Carlo Marcelo Arenas Belón
> > wrote:
> > > On Wed, May 17, 2023 at 09:09:02PM
On Fri, Jun 9, 2023 at 12:06 AM Jaroslav Škarvada wrote:
> diff: in: Value too large for defined data type
This has nothing to do with the new glibc, but with the fact that your
diff is affected by bug#63492.
upgrading to diffutils 3.10 should address that.
Carlo
ing is the solution, but grep already has a feature that could be
used to provide a solution as shown by the following scriptlet
(including an scaled data file) :
$ cat > c.csv
USER,TIP
john,0
jane,10
carenas,100
$ ( grep -m1 USER && grep carenas ) < c.csv
USER,TIP
carenas,100
Carlo
Enable the PCRE2 flag that will be released with 10.43 to keep
[[:digit:]] ASCII just like it was done already for `\d`.
Carlo
0001-pcre-make-d-and-digit-consistent-in-UCP-mode.patch
Description: Binary data
sets
a strict minimum of 10.34 as that is required to pass all tests, even
if the issues are minimal and likely to be real bugs that the old PCRE
just hide, there is likely more work pending in this area.
Performance seems equivalent, and it also seems functionally complete.
Signed-off-by: Carlo
example:
/\A(?m:\s*^(?:#\w+.*\s*|extern\s+.+)$)*+(?\s*namespace(?:\s+utTestNamespace\s*(?>(?{(?:[^{}]*(?&block)*)*}))|(\s*[\w:]*\s*{)(?&namespace)\s*}))\s*\z/
Carlo
[1] https://www.pcre.org/current/doc/html/pcre2pattern.html#internaloptions
like value by
sljit.
Alternatively, a smaller maximum could be selected as it has been
documented[1] that more than 1MB would be unrealistic.
[1] https://www.pcre.org/original/doc/html/pcrejit.html#SEC8
Signed-off-by: Carlo Marcelo Arenas Belón
---
src/pcresearch.c | 4
1 file changed, 4
in #51710[1]
Carlo
[1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51710
>From 29c2f2238ed58ceb4101687f3aae7265f6839025 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?=
Date: Mon, 8 Nov 2021 21:27:03 -0800
Subject: [PATCH v2] pcre: migrate to pcre2
MIME-Version: 1.
rom caeca5e806fe1b2e368833f05bb4cfb75763d1b3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?=
Date: Sat, 16 Oct 2021 01:38:11 -0700
Subject: [PATCH] pcre: add a flag to disable JIT
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8
expected
LF characters, but a full fix will have to wait until PCRE2.
Signed-off-by: Carlo Marcelo Arenas Belón
---
tests/pcre-context | 40 ++--
1 file changed, 22 insertions(+), 18 deletions(-)
diff --git a/tests/pcre-context b/tests/pcre-context
index
On Mon, Nov 15, 2021 at 08:17:02AM -0800, Paul Eggert wrote:
> On 11/14/21 20:44, Carlo Arenas wrote:
>
> > > This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for
> > > forward compatibility to a potential future version of PCRE2 that may
> > &
On Mon, Nov 15, 2021 at 03:24:41PM -0800, Paul Eggert wrote:
> On 11/15/21 12:49, Carlo Marcelo Arenas Belón wrote:
>
> > Apologies, I realize it is difficult to talk about code in abstract when
> > not inlined, but I think it will better addressed by "fixing" it
instead.
Alternatively JIT could be disabled instead, but the option selected has
less of an impact on performance.
Carlo
>From 9194c8e9f9ca7315c2e8c25a7986d0690fb31d7c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?=
Date: Thu, 20 Apr 2023 18:37:20 -0700
Subject: [PA
On Fri, Apr 21, 2023 at 11:42:50AM -0700, Paul Eggert wrote:
> On 2023-04-20 19:04, Carlo Marcelo Arenas Belón wrote:
> > All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> > its JIT implementation that results in failure to match for the negative
> > pe
Building against a different version of PCRE2 that the one that is provided
with the system is complicated by the fact that unlike what is advertised,
if a pkg-config module for libpcre2-8 is found, it will override the values
that were provided with PCRE_CFLAGS and PCRE_LIBS.
Carlo
>F
Would the attached workaround the issue?
Carlo
>From 1fb2147cead1d201b64f4b17154181cd6278eb7f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?=
Date: Sat, 13 May 2023 07:28:35 -0700
Subject: [PATCH] tests: skip y2038 test upon compare failure
* tests/y2038-vs
Could you apply the attached patch?
Carlo
>From b19df9fa4402349e8ae3c35f0e3738f66d354d59 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?=
Date: Sat, 13 May 2023 07:28:35 -0700
Subject: [PATCH v2] tests: protect y2038 against diff failures
* tests/y2038-vs-32-
would workaround the diffutils bug in the test, and show that
grep is working.
Carlo
[1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=63492
>From 635b53c17492dbf0233c9b803e5a21c82e36d7f5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?=
Date: Sat, 13 May 2023
see this is part of the gnulib tests.
Carlo
>From d1adf4035c89d4f215ccff48643df7784fbde5ba Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?=
Date: Tue, 16 May 2023 00:11:24 -0700
Subject: [PATCH] gnulib: avoid mbrlen-tests
Since e319a8 (grep: improve perfor
On Thu, May 18, 2023 at 10:09:38PM +0200, Jim Meyering wrote:
> On Thu, May 18, 2023 at 2:44 PM Carlo Marcelo Arenas Belón
> wrote:
> > On Wed, May 17, 2023 at 09:09:02PM -0400, Caleb Zulawski wrote:
> > >
> > > Isn’t this test too strict, then?
> >
> &g
53 matches
Mail list logo