[Python-Dev] hierarchicial named groups extension to the re library

2005-04-02 Thread ottrey

I've written an extension to the re library, to provide a more
complete matching of hierarchical named groups in regular expressions.

I've set up a sourceforge project for it:

  http://pyre2.sourceforge.net/

re2 extracts a hierarchy of named groups matches from a string,
rather than the flat, incomplete dictionary that the
standard re module returns.

(ie. the re library only returns the ~last~ match for named groups - not
a list of ~all~ the matches for the named groups.  And the hierarchy of
those named groups is non-existant in the flat dictionary of matches
that results. )

eg.

>>> import re
>>> buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
>>> regex='^((?P(?P\d+) (?P[^,]+))(, )?)*$'
>>> pat1=re.compile(regex)
>>> m=pat1.match(buf)
>>> m.groupdict()
{'verse': '10 lords a-leaping', 'number': '10',
'activity': 'lords a-leaping'}

>>> import re2
>>> buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
>>> regex='^((?P(?P\d+) (?P[^,]+))(, )?)*$'
>>> pat2=re2.compile(regex)
>>> x=pat2.extract(buf)
>>> x
{'verse': [{'number': '12', 'activity': 'drummers
drumming'}, {'number': '11', 'activity': 'pipers
piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}



(See http://pyre2.sourceforge.net/ for more details.)


I am wondering what would be the best direction to take this project in.

Firstly is it, (or can it be made) useful enough to be included in the
python stdlib?  (ie. Should I bother writing a PEP for it.)

And if so, would it be best to merge its functionality in with the re
library, or to leave it as a separate module?

And, also are there any suggestions/criticisms on the library itself?
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] hierarchicial named groups extension to the re library

2005-04-02 Thread ottrey

Nicolas Fleury  wrote:
>
> ottrey at py.redsoft.be wrote:
> >>>>import re2
> >>>>buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
> >>>>regex='^((?P(?P\d+) (?P[^,]+))(, )?)*$'
> >>>>pat2=re2.compile(regex)
> >>>>x=pat2.extract(buf)
> >>>>x
> >
> > {'verse': [{'number': '12', 'activity': 'drummers
> > drumming'}, {'number': '11', 'activity': 'pipers
> > piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
>
> Is a dictionary the good container or should another class be used?
> Because in the example the content of the "verse" group is lost,
> excluding its sub-groups.  Something like a hierarchic MatchObject could
> provide access to both information, the sub-groups and the group itself.

Yes, very good point.
Actually it ~is~ a container (that uses dict as it's base class).
(I probably should add the following lines to the example.)

>>> type(x)

>>> x._value
'12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
>>> x.verse[0]._value
'12 drummers drumming'


Josiah Carlson jcarlson at uci.edu wrote:
> If one wanted to match the API of the re module, one should use
> pat2.findall(buf), which would return a list of 'hierarchical match
> objects'

Well, that would be something I'd want to discuss here.
As I'm not sure if I actually ~want~ to match the API of the re module.

> Also, should it be limited to named groups?

I have given that some thought as well.
Internally un-named groups are recursively given the names _group0,
_group1 etc as they are found.  And then those groups are recursively
matched. And in the final step the resulting _Match object is compressed
and those un-named groups are discarded.

IMO If you don't bother to name a group then you probably aren't going
to be interested in it anyway - so why keeping a reference to it?

eg.
If you only wanted to extract the numbers from those verses...

>>> regex='^(((?P\d+) ([^,]+))(, )?)*$'
>>> pat2=re2.compile(regex)
>>> x=pat2.extract(buf)
>>> x
{'number': ['12', '11', '10']}

Before the compression stage the _Match object actually looked like this:

{'_group0': {'_value': '12 drummers drumming, 11 pipers piping, 10
lords
a-leaping', '_group0': [{'_value': '12 drummers drumming, ',
'_group1':
', ', '_group0': {'_value': '12 drummers drumming', '_group1':
'drummers
drumming', 'number': '12'}}, {'_value': '11 pipers piping, ',
'_group1':
', ', '_group0': {'_value': '11 pipers piping', '_group1':
'pipers
piping', 'number': '11'}}, {'_value': '10 lords a-leaping',
'_group0':
{'_value': '10 lords a-leaping', '_group1': 'lords a-leaping',
'number':
'10'}}]}}

But the compression algorithm collected the named groups and brought
them to the surface, to return the much nicer looking:

{'number': ['12', '11', '10']}


NB. There are also a few other tricks up the sleeve of re2.

eg.
It allows for named groups to be repeated in different branches of a
named group hierarchy, without the name redefinition error that the re
library will complain about.

eg.
>>> pat1=re2.compile(
  '(?P(?P(?P[\w ]+)),(?P(?P[\w
]+)))'
)
>>> pat1.extract('Mum,Dad')
{'parents': {'father': {'name': 'Dad'}, 'mother': {'name':
'Mum'}}}


> I find the feature very interesting, but being used to live without it,
> I have difficulty evaluating its usefulness.

Yes - this is a good point too, because it ~is~ different from the re
library.  re2 aims to do all that searching, grouping, iterating and
collecting and constructing work for you.

> However, it reminds me how much at first I found strange that only the
> last match was kept, so I think, FWIW, that on a purist point of vue the
> functionality would make sense in the stdlib in some way or another.

Actually that "last match only" confusion was part of the motivation for
writing it in the first place.


> For .verse[1] or .verse[2] to make sense, it implies that the pattern is
> something like...
> ((?P... )(?P...))
> ... which it isn't.

Good pickup!
You've seen through my smoke and mirrors.  ;-)
That

Re: [Python-Dev] hierarchicial named groups extension to the re library

2005-04-03 Thread ottrey

Hi Gustavo!,

On 4/4/2005, "Gustavo Niemeyer" <[EMAIL PROTECTED]> wrote:
>> Well, that would be something I'd want to discuss here.  As I'm not
>> sure if I actually ~want~ to match the API of the re module.
>
>If this feature is considered a good addition for the standard
>library, integrating it on re would be an interesting option.
>But given what you say above, I'm not sure if *you* want to
>make it a part of re itself.
>

After taking in the great comments made in this discussion, I'm now
thinking that it ~would~ be best to try and integrate the new
functionality with the existing re library (matching the current API),
as there is (at least some) re2 functionality that I think could fit
neatly into the existing re API.

As, like you say:
> This would avoid backward compatibility problems, would give each
> regular expression a single meaning, and would allow interleaving
> hierarchical/non-hierarchical groups.

>If we're going to introduce new features, we should try
>to do that without breaking the current well known meanings they
>have.

Agreed.

>I'm not in favor of that specific implementation.
>
>I'm open to discuss that further.

And I'm happy to work on a proposal that attempts to implement the new
functionality in a backwardly compatible, integrated way.

> I offer myself to integrate the change

Thanx!  That'd be great.

> once we decide on the right way to implement it,
> and achieve consensus on its adoption.

Great.
So I'll conclude from this discussion that (some implementation) of re2
is indeed worth adding to the re library (once we achieve consensus).

And as for creating a PEP...

>Josiah Carlson wrote:
>In general, if developers can readily agree that a functionality should
>be added (i.e. it is "obvious" for some reason), it is added right away.
>Otherwise, a PEP should be written, and reviewed by the community

I'd like to call the current functionality a "work in progress".
ie. I'd like to work on it more, taking on board the comments made here.

I'd also like to take this discussion off the python-dev list now and
shift it to pyre2.  (possibly to come back with a more polished proposal.)

We've set up a development wiki here:

  http://py.redsoft.be/pyre2/wiki/

(feel free to add any more suggestions.)

And there is also a mailing list, if anyone is interested and would like
to subscribe:

  http://lists.sourceforge.net/lists/listinfo/pyre2-devel


Regards.

Chris.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com