[Python-Dev] Regular expressions, Unicode etc.
I have needed to push my stack to teach REs (don't ask), and am taking a look at the RE code. I may be able to extend it to support RFE 694374 and (more importantly) atomic groups and possessive quantifiers. While I regard such things as revolting beyond belief, they make a HELL of a difference to the efficiency of recognising things like HTML tags in a morass of mixed text. The other approach, which is to stick to true regular expressions, and wholly or partially convert to DFAs, has already been rendered impossible by even the limited Perl/PCRE extensions that Python has adopted. My first question is whether this would clash with any ongoing work, including being superseded by any changes in Python 3000. Note that I am NOT proposing to do a fixed task, but will produce a proper proposal only when I know what I can achieve for a small amount of work. If the SRE engine turns out to be unsuitable to extend in these ways, I shall quietly abandon the project. My second one is about Unicode. I really, but REALLY regard it as a serious defect that there is no escape for printing characters. Any code that checks arbitrary text is likely to need them - yes, I know why Perl and hence PCRE doesn't have that, but let's skip that. That is easy to add, though choosing a letter is tricky. Currently \c and \C, for 'character' (I would prefer 'text' or 'printable', but \t is obviously insane and \P is asking for incompatibility with Perl and Java). But attempting to rebuild the Unicode database hasn't worked. Tools/unicode is, er, a trifle incomplete and out of date. The only file I need to change is Objects/unicodetype_db.h, but the init attempts to run Tools/unicode/makeunicodedata.py have not been successful. I may be able to reverse engineer the mechanism enough to get the files off the Unicode site and run it, but I don't want to spend forever on it. Any clues? Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QH, England. Email: [EMAIL PROTECTED] Tel.: +44 1223 334761Fax: +44 1223 334679 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
Further to the above, I found the Unicode sources, have rebuilt the files, but it involved some fairly serious hacking to the building mechanism and I have had to disable the Unicode 3.2 support. And, of course, that means that 4 of the tests fail. This area needs addressing, not least because Python should clearly be upgraded to Unicode 5.0.0 (which is what I am using) at some stage. I am not sure how best to report a bug that essentially says "The build mechanisms for Unicode have suffered bit-rot, no longer work and need redesigning." I could certainly do that, but it's not helpful - people already know that, from the comments :-( Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QH, England. Email: [EMAIL PROTECTED] Tel.: +44 1223 334761Fax: +44 1223 334679 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
Nick Maclaren schrieb: > Further to the above, I found the Unicode sources, have rebuilt > the files, but it involved some fairly serious hacking to the > building mechanism and I have had to disable the Unicode 3.2 > support. And, of course, that means that 4 of the tests fail. > > This area needs addressing, not least because Python should > clearly be upgraded to Unicode 5.0.0 (which is what I am using) > at some stage. > > I am not sure how best to report a bug that essentially says > "The build mechanisms for Unicode have suffered bit-rot, no longer > work and need redesigning." I could certainly do that, but it's > not helpful - people already know that, from the comments :-( FWIW, there is a patch on the tracker at python.org/sf/1571184 that may be helpful to you. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
> My second one is about Unicode. I really, but REALLY regard it as > a serious defect that there is no escape for printing characters. > Any code that checks arbitrary text is likely to need them - yes, > I know why Perl and hence PCRE doesn't have that, but let's skip > that. That is easy to add, though choosing a letter is tricky. > Currently \c and \C, for 'character' (I would prefer 'text' or > 'printable', but \t is obviously insane and \P is asking for > incompatibility with Perl and Java). Before discussing the escape, I'd like to see a specification of it first - what characters precisely would classify as "printing"? > But attempting to rebuild the Unicode database hasn't worked. > Tools/unicode is, er, a trifle incomplete and out of date. The > only file I need to change is Objects/unicodetype_db.h, but the > init attempts to run Tools/unicode/makeunicodedata.py have not > been successful. > > I may be able to reverse engineer the mechanism enough to get > the files off the Unicode site and run it, but I don't want to > spend forever on it. Any clues? I see that you managed to do something here, so I'm not sure what kind of help you still need. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
> Further to the above, I found the Unicode sources, have rebuilt > the files, but it involved some fairly serious hacking to the > building mechanism and I have had to disable the Unicode 3.2 > support. And, of course, that means that 4 of the tests fail. > > This area needs addressing, not least because Python should > clearly be upgraded to Unicode 5.0.0 (which is what I am using) > at some stage. I recommend you use the 4.1 version of the database; this should work out of the box, with no change to the build environment at all. As for updating it - that has to wait until the next release of Python. At that point, 5.1 might be releasesd, so 5.0 might get skipped altogether. > I am not sure how best to report a bug that essentially says > "The build mechanisms for Unicode have suffered bit-rot, no longer > work and need redesigning." I could certainly do that, but it's > not helpful - people already know that, from the comments :-( I would likely close such a report as "works for me" (after testing it does - it did when I last ran it, which was before the release of Python 2.5). It did not suffer from bit-rot - it still works just fine for the version of the database that is supported. As for the need for redesigning - I don't see that need. What specific aspect do you think needs redesigning? If you merely meant to say "I don't understand the code" - this is not enough reason, I remember it took me some time to understand it as well, but now I see that it does precisely what it needs to do, and precisely in the way it needs to do that. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
On 8-Aug-07, at 2:28 AM, Nick Maclaren wrote: > I have needed to push my stack to teach REs (don't ask), and am > taking a look at the RE code. I may be able to extend it to support > RFE 694374 and (more importantly) atomic groups and possessive > quantifiers. While I regard such things as revolting beyond belief, > they make a HELL of a difference to the efficiency of recognising > things like HTML tags in a morass of mixed text. +1. I would use such a feature. > The other approach, which is to stick to true regular expressions, > and wholly or partially convert to DFAs, has already been rendered > impossible by even the limited Perl/PCRE extensions that Python > has adopted. Impossible? Surely, a sufficiently-competent re engine could detect when a DFA is possible to construct? -Mike ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
[ I would appreciate not getting private copies as well. ] =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote: > > Before discussing the escape, I'd like to see a specification of > it first - what characters precisely would classify as "printing"? For basic ASCII and locale-based testing, whatever isprint() says. Just as for isalpha(). For Unicode, whatever people agree! I use the criterion that it has a defined category that doesn't start with 'C' - which is what I think that most people will accept. Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QH, England. Email: [EMAIL PROTECTED] Tel.: +44 1223 334761Fax: +44 1223 334679 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] cc: "Martin v. Löwis"
Re: [Python-Dev] Regular expressions, Unicode etc. =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote: > > I recommend you use the 4.1 version of the database; this should > work out of the box, with no change to the build environment at > all. I tried that, of course. See below. > As for updating it - that has to wait until the next release > of Python. At that point, 5.1 might be releasesd, so 5.0 might > get skipped altogether. Very true. > I would likely close such a report as "works for me" (after testing > it does - it did when I last ran it, which was before the release > of Python 2.5). I think that you will find that you are using a non-standard environment and set of Python sources. I started off with the standard distribution. > It did not suffer from bit-rot - it still works just fine for > the version of the database that is supported. Really? I have just checked 2.5.1, and the same defects are there. > As for the need for redesigning - I don't see that need. What specific > aspect do you think needs redesigning? If you merely meant to say > "I don't understand the code" - this is not enough reason, I > remember it took me some time to understand it as well, but now > I see that it does precisely what it needs to do, and precisely > in the way it needs to do that. Well, here are a selection of the issues that I found: The Makefile includes the command: ncftpget -R ftp.unicode.org . Public/MAPPINGS Not merely is ncftpget not a standard utility, the current mappings are no longer at that location. Indeed, I can see nothing useful in that directory at present, though I haven't searched it in depth! Looking through www.unicode.org, I could find the relevant files for 5.0.0, but for no other version. No, I am NOT going to type in over a megabyte of data from the PDF! makeunicodedata.py has a reference to the Unicode 3.2 files, but they are not present in the standard distribution, the Makefile doesn't fetch them, and I can't find them. makeunicodedata.py refers to (for example) UnicodeData.txt and Modules/unicodedata_db.h as such, which rather requires it to be run in a particular directory. I can find nothing in any file even referring to this. Having run it, running 'make all' does not rebuild Python correctly. I couldn't be bothered to work out why, so I hit it with the usual trick, 'make distclean'. And, of course, it SHOULD be possible to upgrade the Unicode data without having to change version of Python! Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QH, England. Email: [EMAIL PROTECTED] Tel.: +44 1223 334761Fax: +44 1223 334679 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
I am not on "Python 3000", so am restricting. Mike Klaas <[EMAIL PROTECTED]> wrote: > > > I have needed to push my stack to teach REs (don't ask), and am > > taking a look at the RE code. I may be able to extend it to support > > RFE 694374 and (more importantly) atomic groups and possessive > > quantifiers. While I regard such things as revolting beyond belief, > > they make a HELL of a difference to the efficiency of recognising > > things like HTML tags in a morass of mixed text. > > +1. I would use such a feature. I think that I am getting somewhere, but I really dislike the style of _sre.c. It has a very complex semi-stack, semi-finite-state design and no comments on how it is supposed to work. And its memory management looks like a recipe for leaks, so I may well introduce some of them. > > The other approach, which is to stick to true regular expressions, > > and wholly or partially convert to DFAs, has already been rendered > > impossible by even the limited Perl/PCRE extensions that Python > > has adopted. > > Impossible? Surely, a sufficiently-competent re engine could detect > when a DFA is possible to construct? I doubt it. While it isn't equivalent to the halting problem, it IS an intractable one! There are two problems: Firstly, things like backreferences are an absolute no-no. They are not regular, and REs with them in cannot be converted to DFAs. That could be 'solved' by a parser that kicked out such constructions, but it would get screams from many users. Secondly, anything involving explicit or implicit negation can lead to (if I recall) a super-exponential explosion in the size of the DFA. That could be 'solved' by imposing a limit, but few people would be able to predict when it would bite. Thirdly, I would require notice of the question of whether capturing parentheses could be supported, and what semantic changes would be to which were set and how. Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QH, England. Email: [EMAIL PROTECTED] Tel.: +44 1223 334761Fax: +44 1223 334679 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Please help verify SF data dump imported into (future) new tracker
We are getting very close to moving over to the new tracker (hopefully by the end of the month; no firm date yet, though, as we are still planning things out)! Part of the transition is taking a data dump provided by SourceForge and loading it into our Roundup instance. But we need to make some effort to make sure SF's data dump is accurate and that our import is good. If you can, please go to SourceForge and choose some issue (bug, patch, whatever), and then look up the corresponding issue at http://bugs.python.org/ . If there is any discrepancy, please report it at http://psf.upfronthosting.co.za/roundup/meta (the link is also listed at the new tracker as where to report tracker problems) or to this email. -Brett P.S.: If you want to help with the transitionin other ways, you can also help with the tracker docs at http://wiki.python.org/moin/TrackerDocs. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
In 8-Aug-07, at 12:47 PM, Nick Maclaren wrote: > >>> The other approach, which is to stick to true regular expressions, >>> and wholly or partially convert to DFAs, has already been rendered >>> impossible by even the limited Perl/PCRE extensions that Python >>> has adopted. >> >> Impossible? Surely, a sufficiently-competent re engine could detect >> when a DFA is possible to construct? > > I doubt it. While it isn't equivalent to the halting problem, it IS > an intractable one! There are two problems: > > Firstly, things like backreferences are an absolute no-no. They > are not regular, and REs with them in cannot be converted to DFAs. > That could be 'solved' by a parser that kicked out such constructions, > but it would get screams from many users. > > Secondly, anything involving explicit or implicit negation can lead > to (if I recall) a super-exponential explosion in the size of the > DFA. That could be 'solved' by imposing a limit, but few people > would be able to predict when it would bite. Right. The analysis I envisioned would be more along the lines of "if troublesome RE extensions are used, do not attempt to construct a DFA". It could even be exposed via an alternate api (re.compile_dfa ()) that admitted a subset of the usual grammar. > Thirdly, I would require notice of the question of whether capturing > parentheses could be supported, and what semantic changes would be > to which were set and how. Capturing groups are rather integral to the python regex api and, as you say, a major difficulty for DFA-based implementations. Sounds like a task best left to a thirdparty package. -Mike ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
>> Before discussing the escape, I'd like to see a specification of >> it first - what characters precisely would classify as "printing"? > > For basic ASCII and locale-based testing, whatever isprint() says. > Just as for isalpha(). In the mediate term, locale-based testing will go away/be not implementable (in particular, Py3k won't have a byte-oriented character string type, so we can't use isprint). In general, isprint is unsuitable since it doesn't support multi-byte character sets. > For Unicode, whatever people agree! I use the criterion that it > has a defined category that doesn't start with 'C' - which is what > I think that most people will accept. -1. There must be a better specification than that. Can you please explain the concept of "printing character"? If you have a Unicode code point, how do you determine whether it is printing? If rendering it would generate black pixels on white background? Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote: > > >> Before discussing the escape, I'd like to see a specification of > >> it first - what characters precisely would classify as "printing"? > > > > For basic ASCII and locale-based testing, whatever isprint() says. > > Just as for isalpha(). > > In the mediate term, locale-based testing will go away/be not > implementable (in particular, Py3k won't have a byte-oriented > character string type, so we can't use isprint). In general, > isprint is unsuitable since it doesn't support multi-byte > character sets. Well, iswprint isn't so restricted :-) I don't see the relevance of this, as EXACTLY the same problem applies to isalnum and \w. If you can solve one problem (and you have to solve the latter), you can solve the other. > > For Unicode, whatever people agree! I use the criterion that it > > has a defined category that doesn't start with 'C' - which is what > > I think that most people will accept. > > -1. There must be a better specification than that. > > Can you please explain the concept of "printing character"? If > you have a Unicode code point, how do you determine whether it > is printing? If rendering it would generate black pixels on white > background? Eh? This is a character set we are talking about. The proposed extensions to include font and colour are an aberration that I shall thankfully be long retired before they hit. Unicode has a two letter classification of each character, with the main category being in upper case and the subsidiary one in lower. Let's ignore the latter, as it is irrelevant. The main categories are 'Z' (spaces), 'L' (letters), 'N' (numbers), 'S' (Symbols), 'P' (punctuation), 'M' (marks) and 'C' control characters. There are some pretty weird entries in 'L' and 'N' and the difference between 'S', P' and 'M' is arcane, to a degree. But all of the categories except 'C' are things that display, and 'C' is mainly the ASCII controls we know and, er, love - with some similar extras. Obviously, unclassified characters should not be called printing, and equally obviously controls shouldn't. There is no clear reason why the others should not be - especially as the difference between a modifying accent and a free-standing one is something so obscure that most people don't even know that there IS one. The point about an escape for printing characters is to check for bad characters in text input, and the rule I mentioned is fine for that. What's the problem with it? Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QH, England. Email: [EMAIL PROTECTED] Tel.: +44 1223 334761Fax: +44 1223 334679 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Regular expressions, Unicode etc.
>> In the mediate term, locale-based testing will go away/be not
>> implementable (in particular, Py3k won't have a byte-oriented
>> character string type, so we can't use isprint). In general,
>> isprint is unsuitable since it doesn't support multi-byte
>> character sets.
>
> Well, iswprint isn't so restricted :-)
Yes. However, it is even more difficult to convert from
Py_UNICODE to wchar_t in general.
> I don't see the relevance
> of this, as EXACTLY the same problem applies to isalnum and \w.
There is no problem for isalnum: it will just go away if
byte-oriented characters go away. Fortunately, we have a
replacement for the Unicode case.
The relevance is that your specification of "printing character"
as "isprint returns true" is nearly useless, as it only applies
to byte-oriented characters.
> If you can solve one problem (and you have to solve the latter),
> you can solve the other.
Unicode-isalnum is defined as isalpha|isdecimal|isdigit|isnumeric.
isalpha means categories Ll, Lu, Lt, Lo, Lm. isdecimal means
character has the decimal property. isigit means the character has
the digit property. isnumeric means the character has the numeric
property.
>> Can you please explain the concept of "printing character"? If
>> you have a Unicode code point, how do you determine whether it
>> is printing? If rendering it would generate black pixels on white
>> background?
>
> Eh? This is a character set we are talking about. The proposed
> extensions to include font and colour are an aberration that I shall
> thankfully be long retired before they hit.
It was a proposal for a definition. English is not my native
language, and "printing character" means nothing to me. So
I kindly asked for a definition, and suggested one possibility.
I would not have guessed that you consider white-space characters
as "printing", as they don't actually print anything.
> The point about an escape for printing characters is to check
> for bad characters in text input, and the rule I mentioned is
> fine for that. What's the problem with it?
The problem is that you did not quite mention a rule, or else
I missed it.
You seem to be asking for being able to express "not a control
character". I propose that this is best done with UTS#18,
in which you would write
[\P{C}] # or \P{Other}
If this is what you want, I'm all in favor of having it
implemented.
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] cc: "Martin v. Löwis"
>> I would likely close such a report as "works for me" (after testing >> it does - it did when I last ran it, which was before the release >> of Python 2.5). > > I think that you will find that you are using a non-standard > environment and set of Python sources. Please trust me that I didn't. See below. > Well, here are a selection of the issues that I found: > > The Makefile includes the command: > ncftpget -R ftp.unicode.org . Public/MAPPINGS > Not merely is ncftpget not a standard utility, the current mappings > are no longer at that location. Indeed, I can see nothing useful in > that directory at present, though I haven't searched it in depth! Ah, the makefile. I don't think you use it create the Unicode database. It's only good for generating the codecs (Lib/encodings) AFAICT, the mappings are still where they always were: at the location given in the Makefile. (e.g. ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT ) For generating the Unicode database, you need to download the files manually > Looking through www.unicode.org, I could find the relevant files > for 5.0.0, but for no other version. No, I am NOT going to type > in over a megabyte of data from the PDF! And nobody asks you to. Just use http://www.unicode.org/Public/4.1.0/ucd/ (also available through ftp) Did you really believe the Unicode consortium doesn't have the old versions of the character database online? Do you think they are complete fools? > makeunicodedata.py has a reference to the Unicode 3.2 files, but > they are not present in the standard distribution, the Makefile > doesn't fetch them, and I can't find them. Googling for "unicode 3.2 ucd" gives me http://unicode.org/Public/3.2-Update/ as the top hit (of course, you have to know that they call the character database "ucd" to invoke that query). > makeunicodedata.py refers to (for example) UnicodeData.txt and > Modules/unicodedata_db.h as such, which rather requires it to be > run in a particular directory. I can find nothing in any file > even referring to this. Yes, that's something you have to know. Put the files into the root directory of the source tree, then run makeunicodedata.py > And, of course, it SHOULD be possible to upgrade the Unicode data > without having to change version of Python! Well. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
