Re: [Python-Dev] Unicode database

2007-08-09 Thread Nick Maclaren
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote:
>
> > I think that you will find that you are using a non-standard
> > environment and set of Python sources.
>
> Please trust me that I didn't. See below.

I always trust people as much as I trust myself, but I do tend to
check up.  See below.

> Ah, the makefile. I don't think you use it create the Unicode database.
> 
> It's only good for generating the codecs (Lib/encodings)

Yes, but it DOES attempt to download the mappings, and is the ONLY
script which attempts to do so.

beelzebub$find Python-2.5.1 -type f | wc
   34583460  135981
beelzebub$find Python-2.5.1 -type f | xargs grep ftp.unicode.org
Python-2.5.1/Doc/lib/libunicodedata.tex:4.1.0 which is publicly available from 
\url{ftp://ftp.unicode.org/}.
grep: Python-2.5.1/Mac/Icons/Disk: No such file or directory
grep: Image.icns: No such file or directory
grep: Python-2.5.1/Mac/Icons/Python: No such file or directory
grep: Folder.icns: No such file or directory
Python-2.5.1/Misc/NEWS:  at ftp.unicode.org and contain a few updates (e.g. the 
Mac OS
Python-2.5.1/Tools/unicode/Makefile:# files available at ftp://ftp.unicode.org/
Python-2.5.1/Tools/unicode/Makefile:ncftpget -R ftp.unicode.org . 
Public/MAPPINGS
Python-2.5.1/Tools/unicode/gencodec.py:site 
(ftp://ftp.unicode.org/Public/MAPPINGS/) and creates Python codec
Python-2.5.1/Tools/unicode/python-mappings/TIS-620.TXT:#   
ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-11.TXT the
Python-2.5.1/Tools/unicode/python-mappings/TIS-620.TXT:#   
ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-11.TXT
Python-2.5.1/Tools/unicode/python-mappings/KOI8-U.TXT:#   
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
Python-2.5.1/Tools/unicode/python-mappings/CP1140.TXT:#   
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP037.TXT
Python-2.5.1/Modules/unicodedata.c:4.1.0 which is publically available from 
ftp://ftp.unicode.org/.\n

> AFAICT, the mappings are still where they always were: at the
> location given in the Makefile. (e.g.
> ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT
> )

Then you DEFINITELY are using a non-standard set of files.  That
above was from the source of Python 2.5.1 that I have just downloaded.

> Did you really believe the Unicode consortium doesn't have the
> old versions of the character database online? Do you think
> they are complete fools?

Please don't be offensive.  I said that I had failed to find them,
after searching the Unicode Web site.  Now that you have give me
the actual file name, I can find them, but searching on the version
and request for that database leads to unhelpful files.

> Googling for "unicode 3.2 ucd" gives me
> 
> http://unicode.org/Public/3.2-Update/
> 
> as the top hit (of course, you have to know that they call
> the character database "ucd" to invoke that query).

Generally, I distrust Google for such things, as it is as likely
to lead to you the wrong information as the right one.  For example,
that hit you found was on a different logical server, and could
well be an incorrect version of the database.  It is VERY common
for such things to 'escape' into Google.

Have you checked whether or not that file is correct with the
Unicode consortium?


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  [EMAIL PROTECTED]
Tel.:  +44 1223 334761Fax:  +44 1223 334679
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-09 Thread Nick Maclaren
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote:
>
> There is no problem for isalnum: it will just go away if
> byte-oriented characters go away. Fortunately, we have a
> replacement for the Unicode case.

As we do for isprint.

> The relevance is that your specification of "printing character"
> as "isprint returns true" is nearly useless, as it only applies
> to byte-oriented characters.

Eh?  That's ALL I used it to specify!  I used a Unicode-based
specification for Unicode.

> Unicode-isalnum is defined as isalpha|isdecimal|isdigit|isnumeric.
> isalpha means categories Ll, Lu, Lt, Lo, Lm. isdecimal means
> character has the decimal property. isigit means the character has
> the digit property. isnumeric means the character has the numeric
> property.

I sincerely hope it isn't!

Using a mixture of categories and properties is truly horrible,
because it isn't unlikely that some future version of Unicode will
introduce anomalies, even if there aren't any there already.  And
the character aliases file doesn't include any properties called
'digit' or 'decimal' or anything much like them, so they need a
painful amount of reverse engineering to determine what characters
they bind to.  It LOOKS as if they are the subcategories, which
would be OK.

A much cleaner and more future-proof specification would be any
category beginning with 'L' or 'N'.  For example, Unicode doesn't
CURRENTLY have a category for indeterminate numbers or sacred
case, such as are used in some languages, but it isn't implausible
that it would add them :-)

> It was a proposal for a definition. English is not my native
> language, and "printing character" means nothing to me. So
> I kindly asked for a definition, and suggested one possibility.
> I would not have guessed that you consider white-space characters
> as "printing", as they don't actually print anything.

Ah.  It's not an ordinary English term.  It's a computer language
one, so I assumed that you would know it.

It is older than C, but C standardised its use to mean any of the
characters which are intended to display (or leave a blank) with
standard, single positioning semantics.  Almost all languages
derived from C use it in the same sense, and Python has a fair
amount of C ancestry.

> The problem is that you did not quite mention a rule, or else
> I missed it.

I did, and you did!  I said that it should be any character with
a defined category that is not 'control'.

> You seem to be asking for being able to express "not a control
> character". I propose that this is best done with UTS#18,
> in which you would write
> 
>   [\P{C}] # or \P{Other}
>
> If this is what you want, I'm all in favor of having it
> implemented.

Excellent!  We are agreed.  Yes, that is equivalent.

I am NOT volunteering to add the support of that to the parser,
especially now I have discovered the format of the intermediate
data :-(  It would be a foul task, and it isn't clear what syntax
to use, anyway.

There is the horrible POSIX syntax, which I blame (perhaps wrongly)
on HP-UX, and the Java one, which I believe is a modified subset
of the example in UTS#8.  But that says:

All syntax and API presented in this document is only for the
purpose of illustration; there is absolutely no requirement to
follow such syntax or API.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  [EMAIL PROTECTED]
Tel.:  +44 1223 334761Fax:  +44 1223 334679
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode database

2007-08-09 Thread M.-A. Lemburg
Nick Maclaren wrote:
>> Ah, the makefile. I don't think you use it create the Unicode database.
>>
>> It's only good for generating the codecs (Lib/encodings)
> 
> Yes, but it DOES attempt to download the mappings, and is the ONLY
> script which attempts to do so.

Of course it does. The Tools/unicode/Makefile is meant to simplify
recreating the codecs from the (possibly updated) mapping on the Unicode
site.

If it doesn't work for you, that may well be possible, since I wrote
the Makefile and the other related stuff in that directory to help me
with updating the codecs from the mappings. It's only checked in for
convenience.

> beelzebub$find Python-2.5.1 -type f | wc
>34583460  135981
> beelzebub$find Python-2.5.1 -type f | xargs grep ftp.unicode.org
> Python-2.5.1/Doc/lib/libunicodedata.tex:4.1.0 which is publicly available 
> from \url{ftp://ftp.unicode.org/}.
> grep: Python-2.5.1/Mac/Icons/Disk: No such file or directory
> grep: Image.icns: No such file or directory
> grep: Python-2.5.1/Mac/Icons/Python: No such file or directory
> grep: Folder.icns: No such file or directory
> Python-2.5.1/Misc/NEWS:  at ftp.unicode.org and contain a few updates (e.g. 
> the Mac OS
> Python-2.5.1/Tools/unicode/Makefile:# files available at 
> ftp://ftp.unicode.org/
> Python-2.5.1/Tools/unicode/Makefile:ncftpget -R ftp.unicode.org . 
> Public/MAPPINGS
> Python-2.5.1/Tools/unicode/gencodec.py:site 
> (ftp://ftp.unicode.org/Public/MAPPINGS/) and creates Python codec
> Python-2.5.1/Tools/unicode/python-mappings/TIS-620.TXT:#   
> ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-11.TXT the
> Python-2.5.1/Tools/unicode/python-mappings/TIS-620.TXT:#   
> ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-11.TXT
> Python-2.5.1/Tools/unicode/python-mappings/KOI8-U.TXT:#   
> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
> Python-2.5.1/Tools/unicode/python-mappings/CP1140.TXT:#   
> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP037.TXT
> Python-2.5.1/Modules/unicodedata.c:4.1.0 which is publically available from 
> ftp://ftp.unicode.org/.\n
> 
>> AFAICT, the mappings are still where they always were: at the
>> location given in the Makefile. (e.g.
>> ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT
>> )
> 
> Then you DEFINITELY are using a non-standard set of files.  That
> above was from the source of Python 2.5.1 that I have just downloaded.

No idea where you get that impression from, but then I'm not really
sure what you're after anyway ;-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 09 2007)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Move to a "py3k" branch *DONE*

2007-08-09 Thread Guido van Rossum
Please spread the word. The py3k-struni branch is dead! Don't use it any more.

--Guido

-- Forwarded message --
From: Guido van Rossum <[EMAIL PROTECTED]>
Date: Aug 9, 2007 7:43 AM
Subject: Move to a "py3k" branch *DONE*
To: Python 3000 <[EMAIL PROTECTED]>
Cc: Neal Norwitz <[EMAIL PROTECTED]>


This is done. The new py3k branch is ready for business.

If you currently have the py3k-struni branch checked out (at its top
level), *don't update*, but issue the following commands:

  svn switch svn+ssh://[EMAIL PROTECTED]/python/branches/py3k
  svn update

Only a small amount of activity should result (unless you didn't svn
update for a long time).

For the p3yk branch, the same instructions will work, but the svn
update will update most of your tree. A "make clean" is recommended in
this case.

Left to do:

- update the wikis
- clean out the old branches
- switch the buildbot and the doc builder to use the new branch (Neal)

There are currently about 7 failing unit tests left:

test_bsddb
test_bsddb3
test_email
test_email_codecs
test_email_renamed
test_sqlite
test_urllib2_localnet

See http://wiki.python.org/moin/Py3kStrUniTests for detailed status
regarding these.

--Guido

On 8/9/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> I am starting now. Please, no more checkins to either p3yk ot py3k-struni.
>
> On 8/8/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> > I would like to move to a new branch soon for all Py3k development.
> >
> > I plan to name the branch "py3k".  It will be branched from
> > py3k-struni.  I will do one last set of merges from the trunk via p3yk
> > (note typo!) and py3k-struni, and then I will *delete* the old py3k
> > and py3k-struni branches (you will still be able to access their last
> > known good status by syncing back to a previous revision).  I will
> > temporarily shut up some unit tests to avoid getting endless spam from
> > Neal's buildbot.
> >
> > After the switch, you should be able to switch your workspaces to the
> > new branch using the "svn switch" command.
> >
> > If anyone is in the middle of something that would become painful due
> > to this changeover, let me know ASAP and I'll delay.
> >
> > I will send out another message when I start the move, and another
> > when I finish it.
> >
> > --
> > --Guido van Rossum (home page: http://www.python.org/~guido/)
> >
>
>
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>


--
--Guido van Rossum (home page: http://www.python.org/~guido/)


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode database

2007-08-09 Thread Martin v. Löwis
>> Ah, the makefile. I don't think you use it create the Unicode database.
>>
>> It's only good for generating the codecs (Lib/encodings)
> 
> Yes, but it DOES attempt to download the mappings, and is the ONLY
> script which attempts to do so.

Sure. But (again): you don't need to have the mappings at all for
what you want to achieve. So there is no point in downloading them

> beelzebub$find Python-2.5.1 -type f | xargs grep ftp.unicode.org
> Python-2.5.1/Doc/lib/libunicodedata.tex:4.1.0 which is publicly available 
> from \url{ftp://ftp.unicode.org/}.
> grep: Python-2.5.1/Mac/Icons/Disk: No such file or directory
> grep: Image.icns: No such file or directory
> grep: Python-2.5.1/Mac/Icons/Python: No such file or directory
> grep: Folder.icns: No such file or directory
> Python-2.5.1/Misc/NEWS:  at ftp.unicode.org and contain a few updates (e.g. 
> the Mac OS
> Python-2.5.1/Tools/unicode/Makefile:# files available at 
> ftp://ftp.unicode.org/
> Python-2.5.1/Tools/unicode/Makefile:ncftpget -R ftp.unicode.org . 
> Public/MAPPINGS
> Python-2.5.1/Tools/unicode/gencodec.py:site 
> (ftp://ftp.unicode.org/Public/MAPPINGS/) and creates Python codec
> Python-2.5.1/Tools/unicode/python-mappings/TIS-620.TXT:#   
> ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-11.TXT the
> Python-2.5.1/Tools/unicode/python-mappings/TIS-620.TXT:#   
> ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-11.TXT
> Python-2.5.1/Tools/unicode/python-mappings/KOI8-U.TXT:#   
> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
> Python-2.5.1/Tools/unicode/python-mappings/CP1140.TXT:#   
> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP037.TXT
> Python-2.5.1/Modules/unicodedata.c:4.1.0 which is publically available from 
> ftp://ftp.unicode.org/.\n
> 
>> AFAICT, the mappings are still where they always were: at the
>> location given in the Makefile. (e.g.
>> ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT
>> )
> 
> Then you DEFINITELY are using a non-standard set of files.  That
> above was from the source of Python 2.5.1 that I have just downloaded.

I don't understand. Why does this follow? What should I read out
of the grep lines above, and why does my citing of a URL prove
that I did something to my build environment?

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-09 Thread Martin v. Löwis
Nick Maclaren schrieb:
>> The relevance is that your specification of "printing character"
>> as "isprint returns true" is nearly useless, as it only applies
>> to byte-oriented characters.
> 
> Eh?  That's ALL I used it to specify!  I used a Unicode-based
> specification for Unicode.

Your specification was "For Unicode, whatever people agree!"

I would not call that "Unicode-based".

>> Unicode-isalnum is defined as isalpha|isdecimal|isdigit|isnumeric.
>> isalpha means categories Ll, Lu, Lt, Lo, Lm. isdecimal means
>> character has the decimal property. isigit means the character has
>> the digit property. isnumeric means the character has the numeric
>> property.
> 
> I sincerely hope it isn't!

Please read the code.

>> It was a proposal for a definition. English is not my native
>> language, and "printing character" means nothing to me. So
>> I kindly asked for a definition, and suggested one possibility.
>> I would not have guessed that you consider white-space characters
>> as "printing", as they don't actually print anything.
> 
> Ah.  It's not an ordinary English term.  It's a computer language
> one, so I assumed that you would know it.

I know the term "printable character", which is what I read
in definitions of the isprint() routine. "printing character"
I never heard before.

Regards,
Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-09 Thread Greg Ewing
Martin v. Löwis wrote:
> I know the term "printable character", which is what I read
> in definitions of the isprint() routine. "printing character"
> I never heard before.

Hmmm... I guess this means your brain is using a
part-of-speech-sensitive word->technical_meaning
mapping.

Perhaps this will be fixed in English 3.0...

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiem! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Regular expressions, Unicode etc.

2007-08-09 Thread James Y Knight
On Aug 8, 2007, at 3:47 PM, Nick Maclaren wrote:
> Firstly, things like backreferences are an absolute no-no.  They
> are not regular, and REs with them in cannot be converted to DFAs.
> That could be 'solved' by a parser that kicked out such constructions,
> but it would get screams from many users.

People keep saying things like this as if GNU grep and tcl's regular  
expression matchers didn't exist.
See http://www.tcl.tk/man/tcl8.5/TclCmd/re_syntax.htm for example.

time python -c 'import re; print re.match("("+"a?"*26+"a"*26+")b\\1",  
"a"*26+"b"+"a"*26).group(0)'
aabaa

real0m5.913s
user0m5.905s
sys 0m0.006s

time echo 'aabaa' |  
grep -E '(a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a? 
aa)b\1'
aabaa

real0m0.002s
user0m0.002s
sys 0m0.000s

James
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com