[Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Nick Maclaren
I have needed to push my stack to teach REs (don't ask), and am
taking a look at the RE code.  I may be able to extend it to support
RFE 694374 and (more importantly) atomic groups and possessive
quantifiers.  While I regard such things as revolting beyond belief,
they make a HELL of a difference to the efficiency of recognising
things like HTML tags in a morass of mixed text.

The other approach, which is to stick to true regular expressions,
and wholly or partially convert to DFAs, has already been rendered
impossible by even the limited Perl/PCRE extensions that Python
has adopted.

My first question is whether this would clash with any ongoing
work, including being superseded by any changes in Python 3000.

Note that I am NOT proposing to do a fixed task, but will produce
a proper proposal only when I know what I can achieve for a small
amount of work.  If the SRE engine turns out to be unsuitable to
extend in these ways, I shall quietly abandon the project.



My second one is about Unicode.  I really, but REALLY regard it as
a serious defect that there is no escape for printing characters.
Any code that checks arbitrary text is likely to need them - yes,
I know why Perl and hence PCRE doesn't have that, but let's skip
that.  That is easy to add, though choosing a letter is tricky.
Currently \c and \C, for 'character' (I would prefer 'text' or
'printable', but \t is obviously insane and \P is asking for
incompatibility with Perl and Java).

But attempting to rebuild the Unicode database hasn't worked.
Tools/unicode is, er, a trifle incomplete and out of date.  The
only file I need to change is Objects/unicodetype_db.h, but the
init attempts to run Tools/unicode/makeunicodedata.py have not
been successful.

I may be able to reverse engineer the mechanism enough to get
the files off the Unicode site and run it, but I don't want to
spend forever on it.  Any clues?


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  [EMAIL PROTECTED]
Tel.:  +44 1223 334761Fax:  +44 1223 334679
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Nick Maclaren
Further to the above, I found the Unicode sources, have rebuilt
the files, but it involved some fairly serious hacking to the
building mechanism and I have had to disable the Unicode 3.2
support.  And, of course, that means that 4 of the tests fail.

This area needs addressing, not least because Python should
clearly be upgraded to Unicode 5.0.0 (which is what I am using)
at some stage.

I am not sure how best to report a bug that essentially says
"The build mechanisms for Unicode have suffered bit-rot, no longer
work and need redesigning."  I could certainly do that, but it's
not helpful - people already know that, from the comments :-(


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  [EMAIL PROTECTED]
Tel.:  +44 1223 334761Fax:  +44 1223 334679
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Georg Brandl
Nick Maclaren schrieb:
> Further to the above, I found the Unicode sources, have rebuilt
> the files, but it involved some fairly serious hacking to the
> building mechanism and I have had to disable the Unicode 3.2
> support.  And, of course, that means that 4 of the tests fail.
> 
> This area needs addressing, not least because Python should
> clearly be upgraded to Unicode 5.0.0 (which is what I am using)
> at some stage.
> 
> I am not sure how best to report a bug that essentially says
> "The build mechanisms for Unicode have suffered bit-rot, no longer
> work and need redesigning."  I could certainly do that, but it's
> not helpful - people already know that, from the comments :-(

FWIW, there is a patch on the tracker at python.org/sf/1571184 that may be
helpful to you.

Georg


-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Martin v. Löwis
> My second one is about Unicode.  I really, but REALLY regard it as
> a serious defect that there is no escape for printing characters.
> Any code that checks arbitrary text is likely to need them - yes,
> I know why Perl and hence PCRE doesn't have that, but let's skip
> that.  That is easy to add, though choosing a letter is tricky.
> Currently \c and \C, for 'character' (I would prefer 'text' or
> 'printable', but \t is obviously insane and \P is asking for
> incompatibility with Perl and Java).

Before discussing the escape, I'd like to see a specification of
it first - what characters precisely would classify as "printing"?

> But attempting to rebuild the Unicode database hasn't worked.
> Tools/unicode is, er, a trifle incomplete and out of date.  The
> only file I need to change is Objects/unicodetype_db.h, but the
> init attempts to run Tools/unicode/makeunicodedata.py have not
> been successful.
> 
> I may be able to reverse engineer the mechanism enough to get
> the files off the Unicode site and run it, but I don't want to
> spend forever on it.  Any clues?

I see that you managed to do something here, so I'm not sure
what kind of help you still need.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Martin v. Löwis
> Further to the above, I found the Unicode sources, have rebuilt
> the files, but it involved some fairly serious hacking to the
> building mechanism and I have had to disable the Unicode 3.2
> support.  And, of course, that means that 4 of the tests fail.
> 
> This area needs addressing, not least because Python should
> clearly be upgraded to Unicode 5.0.0 (which is what I am using)
> at some stage.

I recommend you use the 4.1 version of the database; this should
work out of the box, with no change to the build environment at
all.

As for updating it - that has to wait until the next release
of Python. At that point, 5.1 might be releasesd, so 5.0 might
get skipped altogether.

> I am not sure how best to report a bug that essentially says
> "The build mechanisms for Unicode have suffered bit-rot, no longer
> work and need redesigning."  I could certainly do that, but it's
> not helpful - people already know that, from the comments :-(

I would likely close such a report as "works for me" (after testing
it does - it did when I last ran it, which was before the release
of Python 2.5).
It did not suffer from bit-rot - it still works just fine for
the version of the database that is supported.

As for the need for redesigning - I don't see that need. What specific
aspect do you think needs redesigning? If you merely meant to say
"I don't understand the code" - this is not enough reason, I
remember it took me some time to understand it as well, but now
I see that it does precisely what it needs to do, and precisely
in the way it needs to do that.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Mike Klaas
On 8-Aug-07, at 2:28 AM, Nick Maclaren wrote:

> I have needed to push my stack to teach REs (don't ask), and am
> taking a look at the RE code.  I may be able to extend it to support
> RFE 694374 and (more importantly) atomic groups and possessive
> quantifiers.  While I regard such things as revolting beyond belief,
> they make a HELL of a difference to the efficiency of recognising
> things like HTML tags in a morass of mixed text.

+1.  I would use such a feature.

> The other approach, which is to stick to true regular expressions,
> and wholly or partially convert to DFAs, has already been rendered
> impossible by even the limited Perl/PCRE extensions that Python
> has adopted.

Impossible?  Surely, a sufficiently-competent re engine could detect  
when a DFA is possible to construct?

-Mike
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Nick Maclaren
[ I would appreciate not getting private copies as well. ]

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote:
>
> Before discussing the escape, I'd like to see a specification of
> it first - what characters precisely would classify as "printing"?

For basic ASCII and locale-based testing, whatever isprint() says.
Just as for isalpha().

For Unicode, whatever people agree!  I use the criterion that it
has a defined category that doesn't start with 'C' - which is what
I think that most people will accept.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  [EMAIL PROTECTED]
Tel.:  +44 1223 334761Fax:  +44 1223 334679
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] cc: "Martin v. Löwis"

2007-08-08 Thread Nick Maclaren
Re: [Python-Dev] Regular expressions, Unicode etc.
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote:
>
> I recommend you use the 4.1 version of the database; this should
> work out of the box, with no change to the build environment at
> all.

I tried that, of course.  See below.

> As for updating it - that has to wait until the next release
> of Python. At that point, 5.1 might be releasesd, so 5.0 might
> get skipped altogether.

Very true.

> I would likely close such a report as "works for me" (after testing
> it does - it did when I last ran it, which was before the release
> of Python 2.5).

I think that you will find that you are using a non-standard
environment and set of Python sources.  I started off with the
standard distribution.

> It did not suffer from bit-rot - it still works just fine for
> the version of the database that is supported.

Really?  I have just checked 2.5.1, and the same defects are there.

> As for the need for redesigning - I don't see that need. What specific
> aspect do you think needs redesigning? If you merely meant to say
> "I don't understand the code" - this is not enough reason, I
> remember it took me some time to understand it as well, but now
> I see that it does precisely what it needs to do, and precisely
> in the way it needs to do that.

Well, here are a selection of the issues that I found:

The Makefile includes the command:
ncftpget -R ftp.unicode.org . Public/MAPPINGS
Not merely is ncftpget not a standard utility, the current mappings
are no longer at that location.  Indeed, I can see nothing useful in
that directory at present, though I haven't searched it in depth!

Looking through www.unicode.org, I could find the relevant files
for 5.0.0, but for no other version.  No, I am NOT going to type
in over a megabyte of data from the PDF!

makeunicodedata.py has a reference to the Unicode 3.2 files, but
they are not present in the standard distribution, the Makefile
doesn't fetch them, and I can't find them.

makeunicodedata.py refers to (for example) UnicodeData.txt and
Modules/unicodedata_db.h as such, which rather requires it to be
run in a particular directory.  I can find nothing in any file
even referring to this.

Having run it, running 'make all' does not rebuild Python correctly.
I couldn't be bothered to work out why, so I hit it with the usual
trick, 'make distclean'.

And, of course, it SHOULD be possible to upgrade the Unicode data
without having to change version of Python!


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  [EMAIL PROTECTED]
Tel.:  +44 1223 334761Fax:  +44 1223 334679
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Nick Maclaren
I am not on "Python 3000", so am restricting.

Mike Klaas <[EMAIL PROTECTED]> wrote:
>
> > I have needed to push my stack to teach REs (don't ask), and am
> > taking a look at the RE code.  I may be able to extend it to support
> > RFE 694374 and (more importantly) atomic groups and possessive
> > quantifiers.  While I regard such things as revolting beyond belief,
> > they make a HELL of a difference to the efficiency of recognising
> > things like HTML tags in a morass of mixed text.
> 
> +1.  I would use such a feature.

I think that I am getting somewhere, but I really dislike the style
of _sre.c.  It has a very complex semi-stack, semi-finite-state
design and no comments on how it is supposed to work.

And its memory management looks like a recipe for leaks, so I may
well introduce some of them.

> > The other approach, which is to stick to true regular expressions,
> > and wholly or partially convert to DFAs, has already been rendered
> > impossible by even the limited Perl/PCRE extensions that Python
> > has adopted.
> 
> Impossible?  Surely, a sufficiently-competent re engine could detect  
> when a DFA is possible to construct?

I doubt it.  While it isn't equivalent to the halting problem, it IS
an intractable one!  There are two problems:

Firstly, things like backreferences are an absolute no-no.  They
are not regular, and REs with them in cannot be converted to DFAs.
That could be 'solved' by a parser that kicked out such constructions,
but it would get screams from many users.

Secondly, anything involving explicit or implicit negation can lead
to (if I recall) a super-exponential explosion in the size of the
DFA.  That could be 'solved' by imposing a limit, but few people
would be able to predict when it would bite.

Thirdly, I would require notice of the question of whether capturing
parentheses could be supported, and what semantic changes would be
to which were set and how.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  [EMAIL PROTECTED]
Tel.:  +44 1223 334761Fax:  +44 1223 334679
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Please help verify SF data dump imported into (future) new tracker

2007-08-08 Thread Brett Cannon
We are getting very close to moving over to the new tracker (hopefully
by the end of the month; no firm date yet, though, as we are still
planning things out)!

Part of the transition is taking a data dump provided by SourceForge
and loading it into our Roundup instance.  But we need to make some
effort to make sure SF's data dump is accurate and that our import is
good.

If you can, please go to SourceForge and choose some issue (bug,
patch, whatever), and then look up the corresponding issue at
http://bugs.python.org/ .  If there is any discrepancy, please report
it at http://psf.upfronthosting.co.za/roundup/meta (the link is also
listed at the new tracker as where to report tracker problems) or to
this email.

-Brett

P.S.: If you want to help with the transitionin other ways, you can
also help with the tracker docs at
http://wiki.python.org/moin/TrackerDocs.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Mike Klaas
In 8-Aug-07, at 12:47 PM, Nick Maclaren wrote:

>
>>> The other approach, which is to stick to true regular expressions,
>>> and wholly or partially convert to DFAs, has already been rendered
>>> impossible by even the limited Perl/PCRE extensions that Python
>>> has adopted.
>>
>> Impossible?  Surely, a sufficiently-competent re engine could detect
>> when a DFA is possible to construct?
>
> I doubt it.  While it isn't equivalent to the halting problem, it IS
> an intractable one!  There are two problems:
>
> Firstly, things like backreferences are an absolute no-no.  They
> are not regular, and REs with them in cannot be converted to DFAs.
> That could be 'solved' by a parser that kicked out such constructions,
> but it would get screams from many users.
>
> Secondly, anything involving explicit or implicit negation can lead
> to (if I recall) a super-exponential explosion in the size of the
> DFA.  That could be 'solved' by imposing a limit, but few people
> would be able to predict when it would bite.

Right.  The analysis I envisioned would be more along the lines of  
"if troublesome RE extensions are used, do not attempt to construct a  
DFA".  It could even be exposed via an alternate api (re.compile_dfa 
()) that admitted a subset of the usual grammar.

> Thirdly, I would require notice of the question of whether capturing
> parentheses could be supported, and what semantic changes would be
> to which were set and how.

Capturing groups are rather integral to the python regex api and, as  
you say, a major difficulty for DFA-based implementations.  Sounds  
like a task best left to a thirdparty package.

-Mike
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Martin v. Löwis
>> Before discussing the escape, I'd like to see a specification of
>> it first - what characters precisely would classify as "printing"?
> 
> For basic ASCII and locale-based testing, whatever isprint() says.
> Just as for isalpha().

In the mediate term, locale-based testing will go away/be not
implementable (in particular, Py3k won't have a byte-oriented
character string type, so we can't use isprint). In general,
isprint is unsuitable since it doesn't support multi-byte
character sets.

> For Unicode, whatever people agree!  I use the criterion that it
> has a defined category that doesn't start with 'C' - which is what
> I think that most people will accept.

-1. There must be a better specification than that.

Can you please explain the concept of "printing character"? If
you have a Unicode code point, how do you determine whether it
is printing? If rendering it would generate black pixels on white
background?

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Nick Maclaren
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote:
> 
> >> Before discussing the escape, I'd like to see a specification of
> >> it first - what characters precisely would classify as "printing"?
> > 
> > For basic ASCII and locale-based testing, whatever isprint() says.
> > Just as for isalpha().
> 
> In the mediate term, locale-based testing will go away/be not
> implementable (in particular, Py3k won't have a byte-oriented
> character string type, so we can't use isprint). In general,
> isprint is unsuitable since it doesn't support multi-byte
> character sets.

Well, iswprint isn't so restricted :-)  I don't see the relevance
of this, as EXACTLY the same problem applies to isalnum and \w.
If you can solve one problem (and you have to solve the latter),
you can solve the other.

> > For Unicode, whatever people agree!  I use the criterion that it
> > has a defined category that doesn't start with 'C' - which is what
> > I think that most people will accept.
> 
> -1. There must be a better specification than that.
> 
> Can you please explain the concept of "printing character"? If
> you have a Unicode code point, how do you determine whether it
> is printing? If rendering it would generate black pixels on white
> background?

Eh?  This is a character set we are talking about.  The proposed
extensions to include font and colour are an aberration that I shall
thankfully be long retired before they hit.

Unicode has a two letter classification of each character, with
the main category being in upper case and the subsidiary one in
lower.  Let's ignore the latter, as it is irrelevant.  The main
categories are 'Z' (spaces), 'L' (letters), 'N' (numbers),
'S' (Symbols), 'P' (punctuation), 'M' (marks) and 'C' control
characters.

There are some pretty weird entries in 'L' and 'N' and the
difference between 'S', P' and 'M' is arcane, to a degree.  But
all of the categories except 'C' are things that display, and
'C' is mainly the ASCII controls we know and, er, love - with
some similar extras.

Obviously, unclassified characters should not be called printing,
and equally obviously controls shouldn't.  There is no clear
reason why the others should not be - especially as the difference
between a modifying accent and a free-standing one is something
so obscure that most people don't even know that there IS one.

The point about an escape for printing characters is to check
for bad characters in text input, and the rule I mentioned is
fine for that.  What's the problem with it?


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  [EMAIL PROTECTED]
Tel.:  +44 1223 334761Fax:  +44 1223 334679
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-08 Thread Martin v. Löwis
>> In the mediate term, locale-based testing will go away/be not
>> implementable (in particular, Py3k won't have a byte-oriented
>> character string type, so we can't use isprint). In general,
>> isprint is unsuitable since it doesn't support multi-byte
>> character sets.
> 
> Well, iswprint isn't so restricted :-) 

Yes. However, it is even more difficult to convert from
Py_UNICODE to wchar_t in general.

> I don't see the relevance
> of this, as EXACTLY the same problem applies to isalnum and \w.

There is no problem for isalnum: it will just go away if
byte-oriented characters go away. Fortunately, we have a
replacement for the Unicode case.

The relevance is that your specification of "printing character"
as "isprint returns true" is nearly useless, as it only applies
to byte-oriented characters.

> If you can solve one problem (and you have to solve the latter),
> you can solve the other.

Unicode-isalnum is defined as isalpha|isdecimal|isdigit|isnumeric.
isalpha means categories Ll, Lu, Lt, Lo, Lm. isdecimal means
character has the decimal property. isigit means the character has
the digit property. isnumeric means the character has the numeric
property.

>> Can you please explain the concept of "printing character"? If
>> you have a Unicode code point, how do you determine whether it
>> is printing? If rendering it would generate black pixels on white
>> background?
> 
> Eh?  This is a character set we are talking about.  The proposed
> extensions to include font and colour are an aberration that I shall
> thankfully be long retired before they hit.

It was a proposal for a definition. English is not my native
language, and "printing character" means nothing to me. So
I kindly asked for a definition, and suggested one possibility.
I would not have guessed that you consider white-space characters
as "printing", as they don't actually print anything.

> The point about an escape for printing characters is to check
> for bad characters in text input, and the rule I mentioned is
> fine for that.  What's the problem with it?

The problem is that you did not quite mention a rule, or else
I missed it.

You seem to be asking for being able to express "not a control
character". I propose that this is best done with UTS#18,
in which you would write

  [\P{C}] # or \P{Other}

If this is what you want, I'm all in favor of having it
implemented.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] cc: "Martin v. Löwis"

2007-08-08 Thread Martin v. Löwis
>> I would likely close such a report as "works for me" (after testing
>> it does - it did when I last ran it, which was before the release
>> of Python 2.5).
> 
> I think that you will find that you are using a non-standard
> environment and set of Python sources.

Please trust me that I didn't. See below.

> Well, here are a selection of the issues that I found:
> 
> The Makefile includes the command:
> ncftpget -R ftp.unicode.org . Public/MAPPINGS
> Not merely is ncftpget not a standard utility, the current mappings
> are no longer at that location.  Indeed, I can see nothing useful in
> that directory at present, though I haven't searched it in depth!

Ah, the makefile. I don't think you use it create the Unicode database.

It's only good for generating the codecs (Lib/encodings)

AFAICT, the mappings are still where they always were: at the
location given in the Makefile. (e.g.
ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT
)

For generating the Unicode database, you need to download the
files manually

> Looking through www.unicode.org, I could find the relevant files
> for 5.0.0, but for no other version.  No, I am NOT going to type
> in over a megabyte of data from the PDF!

And nobody asks you to. Just use

http://www.unicode.org/Public/4.1.0/ucd/

(also available through ftp)

Did you really believe the Unicode consortium doesn't have the
old versions of the character database online? Do you think
they are complete fools?

> makeunicodedata.py has a reference to the Unicode 3.2 files, but
> they are not present in the standard distribution, the Makefile
> doesn't fetch them, and I can't find them.

Googling for "unicode 3.2 ucd" gives me

http://unicode.org/Public/3.2-Update/

as the top hit (of course, you have to know that they call
the character database "ucd" to invoke that query).

> makeunicodedata.py refers to (for example) UnicodeData.txt and
> Modules/unicodedata_db.h as such, which rather requires it to be
> run in a particular directory.  I can find nothing in any file
> even referring to this.

Yes, that's something you have to know. Put the files into the
root directory of the source tree, then run makeunicodedata.py

> And, of course, it SHOULD be possible to upgrade the Unicode data
> without having to change version of Python!

Well.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com