One more bug:

The call to pcre2_compile_32 should be changed from:

   code = pcre2_compile_32(pattern_ucs, pattern.size(),
                           PCRE2_NO_UTF_CHECK | flags, &error_code,
                           &error_offset, 0);

To:

   code = pcre2_compile_32(pattern_ucs, pattern.size(),
                           *PCRE2_UTF | **PCRE2_UCP* | flags, &error_code,
                           &error_offset, 0);

Without *PCRE2_UTF*, proper Unicode semantics will not be applied (such as
properly handling case matching for non-ASCII characters).

*PCRE2_UCP*, is a little less obvious. I think it would make sense to
enable it, since we care more for correctness than performance. Here's what
the documentation has to say about it:

*“This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
\w, and some of the POSIX character classes. By default, only ASCII
characters are recognized, but if PCRE2_UCP is set, Unicode properties are
used instead to classify characters. More details are given in the section
on generic character types in the pcre2pattern page. If you set PCRE2_UCP,
matching one of the items it affects takes much longer.”*


Finally, I don't think it makes sense to use *PCRE2_NO_UTF_CHECK* since at
best it's a no-op (since we're using UTF-32) and at worst it can cause a
crash when trying to match an invalid string. That's not worth what little
performance benefit there is to gain from it.

Regards,
Elias

On 9 October 2017 at 11:12, Elias Mårtenson <loke...@gmail.com> wrote:

> I found another bug. ↓ is used to indicate that string indexes are
> requested, but the error message when multiple output types are requested
> is wrong:
>
> *      "foo" ⎕RE["⊂↓"] "bar"*
> DOMAIN ERROR+
>       'foo' ⎕RE['⊂↓']'bar'
>       ^             ^
> *      )more*
> Multiple ⎕RE output flags: '⊂↓'. Output flags are: ⊂⍳/
>
> Note the ⍳ in the error message instead of ↓.
>
> Regards,
> Elias
>
> On 9 October 2017 at 10:45, Elias Mårtenson <loke...@gmail.com> wrote:
>
>> I fixed the problem by adding a static_cast<PCRE2_SIZE>(len), but I
>> found another issue: The testcases file is missing.
>>
>> Regards,
>> Elias
>>
>> On 9 October 2017 at 10:41, Elias Mårtenson <loke...@gmail.com> wrote:
>>
>>> Thank you.
>>>
>>> There are some errors when compiling on my Arch system:
>>>
>>> g++ -DHAVE_CONFIG_H -I. -I..    -Wall -I sql -Wold-style-cast -Werror
>>> -I/usr/include -I/usr/include  -rdynamic -g -O2 -MT apl-Quad_RE.o -MD -MP
>>> -MF .deps/apl-Quad_RE.Tpo -c -o apl-Quad_RE.o `test -f 'Quad_RE.cc' || echo
>>> './'`Quad_RE.cc
>>> Quad_RE.cc: In static member function ‘static Value_P
>>> Quad_RE::partition_result(const Regexp&, const Quad_RE::Flags&, const
>>> UCS_string&)’:
>>> Quad_RE.cc:211:42: error: comparison between signed and unsigned integer
>>> expressions [-Werror=sign-compare]
>>>     for (ShapeItem match_id = 1; B_offset < len; match_id +=
>>> match_id_inc)
>>>                                  ~~~~~~~~~^~~~~
>>> cc1plus: all warnings being treated as errors
>>> make[3]: *** [Makefile:2725: apl-Quad_RE.o] Error 1
>>> make[3]: Leaving directory '/home/emartenson/src/apl/src'
>>> make[2]: *** [Makefile:3333: all-recursive] Error 1
>>> make[2]: Leaving directory '/home/emartenson/src/apl/src'
>>> make[1]: *** [Makefile:514: all-recursive] Error 1
>>> make[1]: Leaving directory '/home/emartenson/src/apl'
>>> make: *** [Makefile:401: all] Error 2
>>>
>>> Regards,
>>> Elias
>>>
>>> On 9 October 2017 at 00:47, Juergen Sauermann <
>>> juergen.sauerm...@t-online.de> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have merged Elias' *⎕RE* implementation into GNU APL.
>>>> Thanks, Elias, for contributing it. See *'info apl**'* for a
>>>> description
>>>> and *src/testcases/Q**uad_RE.tc* for examples of how to use *⎕RE*.
>>>>
>>>> *SVN 1012*.
>>>>
>>>> Enjoy,
>>>> /// Jürgen
>>>>
>>>>
>>>
>>
>

Reply via email to