Re: [Python-Dev] Python 3 optimizations continued...

2011-08-31 Thread Stefan Behnel

stefan brunthaler, 30.08.2011 22:41:

Ok, there there's something else you haven't told us. Are you saying
that the original (old) bytecode is still used (and hence written to
and read from .pyc files)?


Short answer: yes.
Long answer: I added an invocation counter to the code object and keep
interpreting in the usual Python interpreter until this counter
reaches a configurable threshold. When it reaches this threshold, I
create the new instruction format and interpret with this optimized
representation. All the macros look exactly the same in the source
code, they are just redefined to use the different instruction format.
I am at no point serializing this representation or the runtime
information gathered by me, as any subsequent invocation might have
different characteristics.


So, basically, you built a JIT compiler but don't want to call it that, 
right? Just because it compiles byte code to other byte code rather than to 
native CPU instructions does not mean it doesn't compile Just In Time.


That actually sounds like a nice feature in general. It could even replace 
(or accompany?) the existing peep hole optimiser as part of a more general 
optimisation architecture, in the sense that it could apply byte code 
optimisations at runtime rather than compile time, potentially based on 
better knowledge about what's actually going on.




I will remove my development commentaries and create a private
repository at bitbucket


I agree with the others that it's best to open up your repository for 
everyone who is interested. I can see no reason why you would want to close 
it back down once it's there.


Stefan

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman

On 8/30/2011 11:03 PM, Stephen J. Turnbull wrote:

Guido van Rossum writes:
  >  On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull  
wrote:

  >  >  For starters, one that doesn't ever return lone surrogates, but rather
  >  >  interprets surrogate pairs as Unicode code points as in UTF-16.  (This
  >  >  is not a Unicode standard definition, it's intended to be suggestive
  >  >  of why many app writers will be distressed if they must use Python
  >  >  unicode/str in a narrow build without a fairly comprehensive library
  >  >  that wraps the arrays in operations that treat unicode/str as an array
  >  >  of code points.)
  >
  >  That sounds like a contradiction -- it wouldn't be a UTF-16 array if
  >  you couldn't tell that it was using UTF-16.

Well, that's why I wrote "intended to be suggestive".  The Unicode
Standard does not specify at all what the internal representation of
characters may be, it only specifies what their external behavior must
be when two processes communicate.  (For "process" as used in the
standard, think "Python modules" here, since we are concerned with the
problems of folks who develop in Python.)  When observing the behavior
of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or
even UTF-32 arrays; only arrays of characters.

Thus, according to the rules of handling a UTF-16 stream, it is an
error to observe a lone surrogate or a surrogate pair that isn't a
high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and
C8-C10).  That's what I mean by "can't tell it's UTF-16".  And I
understand those requirements to mean that operations on UTF-16
streams should produce UTF-16 streams, or raise an error.  Without
that closure property for basic operations on str, I think it's a bad
idea to say that the representation of text in a str in a pre-PEP-393
"narrow" build is UTF-16.  For many users and app developers, it
creates expectations that are not fulfilled.

It's true that common usage is that an array of code units that
usually conforms to UTF-16 may be called "UTF-16" without the closure
properties.  I just disagree with that usage, because there are two
camps that interpret "UTF-16" differently.  One side says, "we have an
array representation in UTF-16 that can handle all Unicode code points
efficiently, and if you think you need more, think again", while the
other says "it's too painful to have to check every result for valid
UTF-16, and we need a UTF-16 type that supports the usual array
operations on *characters* via the usual operators; if you think
otherwise, think again."

Note that despite the (presumed) resolution of the UTF-16 issue for
CPython by PEP 393, at some point a very similar discussion will take
place over "characters" anyway, because users and app developers are
going to want a type that handles composition sequences and/or
grapheme clusters for them, as well as comparison that respects
canonical equivalence, even if it is inefficient compared to str.
That's why I insisted on use of "array of code points" to describe the
PEP 393 str type, rather than "array of characters".


On topic:

So from reading all this discussion, I think this point is rather a key 
one... and it has been made repeatedly in different ways:  Arrays are 
not suitable for manipulating Unicode character sequences, and the str 
type is an array with a veneer of text manipulation operations, which do 
not, and cannot, by themselves, efficiently implement Unicode character 
sequences.


Python wants to, should, and can implement UTF-16 streams, UTF-8 
streams, and UTF-32 streams.  It should, and can implement streams using 
other encodings as well, and also binary streams.


Python wants to, should, and can implement 8-bit, 16-bit, 32-bit, and 
64-bit arrays.  These are efficient to access, index, and slice.


Python implements a veneer on some 8-bit, 16-bit, and 32-bit arrays 
called str (this will be more true post-PEP 393, although it is true 
with caveats presently), which interpret array elements as code units 
(currently) or codepoints (post-PEP), and implements operations that are 
interesting for text processing, with caveats.


There is presently no support for arrays of Unicode grapheme clusters or 
composed characters.  The Python type called str may or may not be 
properly documented (to the extent that there is confusion between the 
actual contents of the elements of the type, and the concept of 
character as defined by Unicode).  From comments Guido has made, he is 
not interested in changing the efficiency or access methods of the str 
type to raise the level of support of Unicode to the composed character, 
or grapheme cluster concepts.  The str type itself can presently be used 
to process other character encodings: if they are fixed width < 32-bit 
elements those encodings might be considered Unicode encodings, but 
there is no requirement that they are, and some operations on str may 
operate with knowledge of some Unicode semantics, so there

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Stephen J. Turnbull
Glenn Linderman writes:

 > From comments Guido has made, he is not interested in changing the
 > efficiency or access methods of the str type to raise the level of
 > support of Unicode to the composed character, or grapheme cluster
 > concepts.

IMO, that would be a bad idea, as higher-level Unicode support should
either be a wrapper around full implementations such as ICU (or
platform support in .NET or Java), or written in pure Python at first.
Thus there is a need for an efficient array of code units type.  PEP
393 allows this to go to the level of code points, but evidently that
is inappropriate for Jython and IronPython.

 > The str type itself can presently be used to process other
 > character encodings:

Not really.  Remember, on input codecs always decode to Unicode and on
output they always encode from Unicode.  How do you propose to get
other encodings into the array of code units?

 > [A "true Unicode" type] could be based on extensions to the
 > existing str type, or it could be based on the array type, or it
 > could based on the bytes type.  It could use an internal format of
 > 32-bit codepoints, PEP 393 variable-size codepoints, or 8- or
 > 16-bit codeunits.

In theory yes, but in practice all of the string methods and libraries
like re operate on str (and often but not always bytes; in particular,
codecs always decode from byte and encode to bytes).

Why bother with anything except arrays of code points at the start?
PEP 393 makes that time-efficient and reasonably space-efficient as a
starting point and allows starting with re or MRAB's regex to get
basic RE functionality or good UTS #18 functionality respectively.
Plus str already has all the usual string operations (.startswith(),
.join(), etc), and we have modules for dealing with the Unicode
Character Database.  Why waste effort reintegrating with all that,
until we have common use cases that need more efficient representation?

There would be some issue in coming up with an appropriate UTF-16 to
code point API for Jython and IronPython, but Terry Reedy has a rather
efficient library for that already.

So this discussion of alternative representations, including use of
high bits to represent properties, is premature optimization
... especially since we don't even have a proto-PEP specifying how
much conformance we want of this new "true Unicode" type in the first
place.

We need to focus on that before optimizing anything.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-31 Thread stefan brunthaler
> I think that you must deal with big endianess because some RISC can't handle
> at all data in little endian format.
>
> In WPython I have wrote some macros which handle both endianess, but lacking
> big endian machines I never had the opportunity to verify if something was
> wrong.
>
I am sorry for the temporal lapse of not getting back to this directly
yesterday, we were just heading out for lunch and I figured it only
out then but immediately forgot it on our way back to the lab...

So, as I have already said, I evaluated my optimizations on x86
(little-endian) and PowerPC 970 (big-endian) and I did not have to
change any of my instruction decoding during interpretation. (The only
nasty bug I still remember vividly was that while on gcc for x86 the
data type char defaults to signed, whereas it defaults to unsigned on
PowerPC's gcc.) When I have time and access to a PowerPC machine again
(an ARM might be interesting, too), I will take a look at the
generated assembly code to figure out why this is working. (I have
some ideas why it might work without changing the code.)

If I run into any problems, I'll gladly contact you :)

BTW: AFAIR, we emailed last year regarding wpython and IIRC your
optimizations could primarily be summarized as clever
superinstructions. I have not implemented anything in that area at all
(and have in fact not even touched the compiler and its peephole
optimizer), but if parts my implementation gets in, I am sure that you
could add some of your work on top of that, too.

Cheers,
--stefan
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-31 Thread stefan brunthaler
> So, basically, you built a JIT compiler but don't want to call it that,
> right? Just because it compiles byte code to other byte code rather than to
> native CPU instructions does not mean it doesn't compile Just In Time.
>
For me, a definition of a JIT compiler or any dynamic compilation
subsystem entails that native machine code is generated at run-time.
Furthermore, I am not compiling from bytecode to bytecode, but rather
changing the instruction encoding underneath and use subsequently use
quickening to optimize interpretation. But, OTOH, I am not aware of a
canonical definition of JIT compilation, so it depends ;)


> I agree with the others that it's best to open up your repository for
> everyone who is interested. I can see no reason why you would want to close
> it back down once it's there.
>
Well, my code has primarily been a vehicle for my research in that
area and thus is not immediately suited to adoption (it does not
adhere to Python C coding standards, contains lots of private comments
about various facts, debugging hints, etc.). The explanation for this
is easy: When I started out on my research it was far from clear that
it would be successful and really that much faster. So, I would like
to clean up the comments and some parts of the code and publish the
code I have without any of the clean-up work for naming conventions,
etc., so that you can all take a look and it is clear what it's all
about. After that we can then have a factual discussion about whether
it fits the bill for you, too, and if so, which changes (naming
conventions, extensive documentation, etc.) are necessary *before* any
adoption is reasonable for you, too.

That seems to be a good way to start off and get results and feedback
quickly, any ideas/complaints/comments/suggestions?

Best regards,
--stefan

PS: I am using Nick's suggested plan to incorporate my changes
directly to the most recent version, as mine is currently only running
on Python 3.1.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull
 wrote:
[me]
>  > That sounds like a contradiction -- it wouldn't be a UTF-16 array if
>  > you couldn't tell that it was using UTF-16.
>
> Well, that's why I wrote "intended to be suggestive".  The Unicode
> Standard does not specify at all what the internal representation of
> characters may be, it only specifies what their external behavior must
> be when two processes communicate.  (For "process" as used in the
> standard, think "Python modules" here, since we are concerned with the
> problems of folks who develop in Python.)  When observing the behavior
> of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or
> even UTF-32 arrays; only arrays of characters.

Hm, that's not how I would read "process". IMO that is an
intentionally vague term, and we are free to decide how to interpret
it. I don't think it will work very well to define a process as a
Python module; what about Python modules that agree about passing
along array of code units (or streams of UTF-8, for that matter)?

This is why I find the issue of Python, the language (and stdlib), as
a whole "conforming to the Unicode standard" such a troublesome
concept -- I think it is something that an application may claim, but
the language should make much more modest claims, such as "the regular
expression syntax supports features X, Y and Z from the Unicode
recommendation XXX, or "the UTF-8 codec will never emit a sequence of
bytes that is invalid according Unicode specification YYY". (As long
as the Unicode references are also versioned or dated.)

I'm fine with saying "it is hard to write Unicode-conforming
application code for reason ZZZ" and proposing a fix (e.g. PEP 393
fixes a specific complaint about code units being inferior to code
points for most types of processing). I'm not fine with saying "the
string datatype should conform to the Unicode standard".

> Thus, according to the rules of handling a UTF-16 stream, it is an
> error to observe a lone surrogate or a surrogate pair that isn't a
> high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and
> C8-C10).  That's what I mean by "can't tell it's UTF-16".

But if you can observe (valid) surrogate pairs it is still UTF-16.

> And I
> understand those requirements to mean that operations on UTF-16
> streams should produce UTF-16 streams, or raise an error.  Without
> that closure property for basic operations on str, I think it's a bad
> idea to say that the representation of text in a str in a pre-PEP-393
> "narrow" build is UTF-16.  For many users and app developers, it
> creates expectations that are not fulfilled.

Ok, I dig this, to some extent. However saying it is UCS-2 is equally
bad. I guess this is why Java and .NET just say their string types
contain arrays of "16-bit characters", with essentially no semantics
attached to the word "character" besides "16-bit unsigned integer".

At the same time I think it would be useful if certain string
operations like .lower() worked in such a way that *if* the input were
valid UTF-16, *then* the output would also be, while *if* the input
contained an invalid surrogate, the result would simply be something
that is no worse (in particular, those are all mapped to themselves).
We could even go further and have .lower() and friends look at
graphemes (multi-code-point characters) if the Unicode std has a
useful definition of e.g. lowercasing graphemes that differed from
lowercasing code points.

An analogy is actually found in .lower() on 8-bit strings in Python 2:
it assumes the string contains ASCII, and non-ASCII characters are
mapped to themselves. If your string contains Latin-1 or EBCDIC or
UTF-8 it will not do the right thing. But that doesn't mean strings
cannot contain those encodings, it just means that the .lower() method
is not useful if they do. (Why ASCII? Because that is the system
encoding in Python 2.)

> It's true that common usage is that an array of code units that
> usually conforms to UTF-16 may be called "UTF-16" without the closure
> properties.  I just disagree with that usage, because there are two
> camps that interpret "UTF-16" differently.  One side says, "we have an
> array representation in UTF-16 that can handle all Unicode code points
> efficiently, and if you think you need more, think again", while the
> other says "it's too painful to have to check every result for valid
> UTF-16, and we need a UTF-16 type that supports the usual array
> operations on *characters* via the usual operators; if you think
> otherwise, think again."

I think we should just document how it behaves and not get hung up on
what it is called. Mentioning UTF-16 is still useful because it
indicates that some operations may act properly on surrogate pairs.
(Also because of course character properties for BMP characters are
respected, etc.)

> Note that despite the (presumed) resolution of the UTF-16 issue for
> CPython by PEP 393, at some point a very similar dis

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman  wrote:
> So from reading all this discussion, I think this point is rather a key
> one... and it has been made repeatedly in different ways:  Arrays are not
> suitable for manipulating Unicode character sequences, and the str type is
> an array with a veneer of text manipulation operations, which do not, and
> cannot, by themselves, efficiently implement Unicode character sequences.

I think this is too strong. The str type is indeed an array, and you
can build useful Unicode manipulation APIs on top of it. Just like
bytes are not UTF-8, but can be used to represent UTF-8 and a
fully-compliant UTF-8 codec can be implemented on top of it.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman  wrote:
> The str type itself can presently be used to process other
> character encodings: if they are fixed width < 32-bit elements those
> encodings might be considered Unicode encodings, but there is no requirement
> that they are, and some operations on str may operate with knowledge of some
> Unicode semantics, so there are caveats.

Actually, the str type in Python 3 and the unicode type in Python 2
are constrained everywhere to either 16-bit or 21-bit "characters".
(Except when writing C code, which can do any number of invalid things
so is the equivalent of assuming 1 == 0.) In particular, on a wide
build, there is no way to get a code point >= 2**21, and I don't want
PEP 393 to change this. So at best we can use these types to repesent
arrays of 21-bit unsigned ints. But I think it is more useful to think
of them as always representing "some form of Unicode", whether that is
UTF-16 (on narrow builds) or 21-bit code points or perhaps some
vaguely similar superset -- but for those code units/code points that
are representable *and* valid (either code points or code units)
according to the (supported version of) the Unicode standard, the
meaning of those code points/units matches that of the standard.

Note that this is different from the bytes type, where the meaning of
a byte is entirely determined by what it means in the programmer's
head.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-31 Thread Guido van Rossum
On Tue, Aug 30, 2011 at 10:04 PM, Cesare Di Mauro
 wrote:
> It isn't, because motivation to do something new with CPython vanishes, at
> least on some areas (virtual machine / ceval.c), even having some ideas to
> experiment with. That's why in my last talk on EuroPython I decided to move
> on other areas (Python objects).

Cesare, I'm really sorry that you became so disillusioned that you
abandoned wordcode. I agree that we were too optimistic about Unladen
Swallow. Also that the existence of PyPy and its PR machine (:-)
should not stop us from improving CPython.

I'm wondering if, with your experience in creating WPython, you could
review Stefan Brunthaler's code and approach (once he's put it up for
review) and possibly the two of you could even work on a joint
project?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 10:08 AM, stefan brunthaler
 wrote:
> Well, my code has primarily been a vehicle for my research in that
> area and thus is not immediately suited to adoption [...].

But if you want to be taken seriously as a researcher, you should
publish your code! Without publication of your *code* research in your
area cannot be reproduced by others, so it is not science. Please stop
being shy and open up what you have. The software engineering issues
can be dealt with separately!

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman

On 8/31/2011 10:12 AM, Guido van Rossum wrote:

On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman  wrote:

So from reading all this discussion, I think this point is rather a key
one... and it has been made repeatedly in different ways:  Arrays are not
suitable for manipulating Unicode character sequences, and the str type is
an array with a veneer of text manipulation operations, which do not, and
cannot, by themselves, efficiently implement Unicode character sequences.

I think this is too strong. The str type is indeed an array, and you
can build useful Unicode manipulation APIs on top of it. Just like
bytes are not UTF-8, but can be used to represent UTF-8 and a
fully-compliant UTF-8 codec can be implemented on top of it.



This statement is a logical conclusion of arguments presented in this 
thread.


1) Applications that wish to do grapheme access, wish to do it by 
grapheme array indexing, because that is the efficient way to do it.


2) As long as str is restricted to holding Unicode code units or code 
points, then it cannot support grapheme array indexing efficiently.


I  have not declared that useful Unicode manipulations APIs cannot be 
built on top of str, only that efficiency will suffer.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 11:51 AM, Glenn Linderman wrote:

>  On 8/31/2011 10:12 AM, Guido van Rossum wrote:
>
> On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman  
>  wrote:
>
>  So from reading all this discussion, I think this point is rather a key
> one... and it has been made repeatedly in different ways:  Arrays are not
> suitable for manipulating Unicode character sequences, and the str type is
> an array with a veneer of text manipulation operations, which do not, and
> cannot, by themselves, efficiently implement Unicode character sequences.
>
>  I think this is too strong. The str type is indeed an array, and you
> can build useful Unicode manipulation APIs on top of it. Just like
> bytes are not UTF-8, but can be used to represent UTF-8 and a
> fully-compliant UTF-8 codec can be implemented on top of it.
>
>
>
> This statement is a logical conclusion of arguments presented in this
> thread.
>
> 1) Applications that wish to do grapheme access, wish to do it by grapheme
> array indexing, because that is the efficient way to do it.
>

I don't believe that should be taken as gospel. In Perl, they don't do array
indexing on strings at all, and use regex matching instead. An API that uses
some kind of cursor on a string might work fine in Python too (for grapheme
matching).

2) As long as str is restricted to holding Unicode code units or code
> points, then it cannot support grapheme array indexing efficiently.
>
> I  have not declared that useful Unicode manipulations APIs cannot be built
> on top of str, only that efficiency will suffer.
>

But you have not proven it.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman

On 8/31/2011 11:56 AM, Guido van Rossum wrote:
On Wed, Aug 31, 2011 at 11:51 AM, Glenn Linderman 
mailto:v%[email protected]>> wrote:


On 8/31/2011 10:12 AM, Guido van Rossum wrote:

On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman  
  wrote:

So from reading all this discussion, I think this point is rather a key
one... and it has been made repeatedly in different ways:  Arrays are not
suitable for manipulating Unicode character sequences, and the str type is
an array with a veneer of text manipulation operations, which do not, and
cannot, by themselves, efficiently implement Unicode character sequences.

I think this is too strong. The str type is indeed an array, and you
can build useful Unicode manipulation APIs on top of it. Just like
bytes are not UTF-8, but can be used to represent UTF-8 and a
fully-compliant UTF-8 codec can be implemented on top of it.



This statement is a logical conclusion of arguments presented in
this thread.

1) Applications that wish to do grapheme access, wish to do it by
grapheme array indexing, because that is the efficient way to do it.


I don't believe that should be taken as gospel. In Perl, they don't do 
array indexing on strings at all, and use regex matching instead. An 
API that uses some kind of cursor on a string might work fine in 
Python too (for grapheme matching).


The last benchmark I saw, regexp in Perl is faster than regexp in 
Python; that was some years back, before regexp in Perl supported quite 
as much Unicode as it does now; not sure if someone has done a recent 
performance benchmarks; Tom's survey indicates that the functionality 
presently differs, so it is not clear if performance benchmarks are 
presently appropriate to attempt to measure Unicode operations in regexp 
between the two languages.


That said, regexp, or some sort of cursor on a string, might be a 
workable solution.  Will it have adequate performance?  Perhaps, at 
least for some applications.  Will it be as conceptually simple as 
indexing an array of graphemes?  No.  Will it ever reach the efficiency 
of indexing an array of graphemes? No.  Does that matter? Depends on the 
application.




2) As long as str is restricted to holding Unicode code units or
code points, then it cannot support grapheme array indexing
efficiently.

I  have not declared that useful Unicode manipulations APIs cannot
be built on top of str, only that efficiency will suffer.


But you have not proven it.


Do you disagree that indexing an array is more efficient than 
manipulating strings with regex or binary trees?  I think not, because 
you are insistent that array indexing of str be preserved as O(1).  I 
agree that I have not proven it; it largely depends on whether or not 
indexing by grapheme cluster is a useful operation in applications.  Yet 
Stephen (I think) has commented that emacs performance goes down as soon 
as multi-byte characters are introduced into an edit buffer.  So I think 
he has proven that efficiency can suffer, in some 
implementations/applications.  Terry's O(k) implementation requires data 
beyond strings, and isn't O(1).
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman

On 8/31/2011 10:20 AM, Guido van Rossum wrote:

On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman  wrote:

The str type itself can presently be used to process other
character encodings: if they are fixed width<  32-bit elements those
encodings might be considered Unicode encodings, but there is no requirement
that they are, and some operations on str may operate with knowledge of some
Unicode semantics, so there are caveats.

Actually, the str type in Python 3 and the unicode type in Python 2
are constrained everywhere to either 16-bit or 21-bit "characters".
(Except when writing C code, which can do any number of invalid things
so is the equivalent of assuming 1 == 0.) In particular, on a wide
build, there is no way to get a code point>= 2**21, and I don't want
PEP 393 to change this. So at best we can use these types to repesent
arrays of 21-bit unsigned ints. But I think it is more useful to think
of them as always representing "some form of Unicode", whether that is
UTF-16 (on narrow builds) or 21-bit code points or perhaps some
vaguely similar superset -- but for those code units/code points that
are representable *and* valid (either code points or code units)
according to the (supported version of) the Unicode standard, the
meaning of those code points/units matches that of the standard.

Note that this is different from the bytes type, where the meaning of
a byte is entirely determined by what it means in the programmer's
head.



Sorry, my Perl background is leaking through.  I didn't double check 
that str constrains the values of each element to range 0x11 but I 
see now by testing that it does.  For some of my ideas, then, either a 
subtype of str would have to be able to relax that constraint, or str 
would not be the appropriate base type to use (but there are other base 
types that could be used, so this is not a serious issue for the ideas).


I have no problem with thinking of str as representing "some form of 
Unicode".  None of my proposals change that, although they may change 
other things, and may invent new forms of Unicode representations. You 
have stated that it is better to document what str actually does, rather 
than attempt to adhere slavishly to Unicode standard concepts.  The 
Unicode Consortium may well define legal, conforming bytestreams for 
communicating processes, but languages and applications are free to use 
other representations internally.  We can either artificially constrain 
ourselves to minor tweaks of the legal conforming bytestreams, or we can 
invent a representation (whether called str or something else) that is 
useful and efficient in practice.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman

On 8/31/2011 5:21 AM, Stephen J. Turnbull wrote:

Glenn Linderman writes:

  >   From comments Guido has made, he is not interested in changing the
  >  efficiency or access methods of the str type to raise the level of
  >  support of Unicode to the composed character, or grapheme cluster
  >  concepts.

IMO, that would be a bad idea,


OK you agree with Guido.


as higher-level Unicode support should
either be a wrapper around full implementations such as ICU (or
platform support in .NET or Java), or written in pure Python at first.
Thus there is a need for an efficient array of code units type.  PEP
393 allows this to go to the level of code points, but evidently that
is inappropriate for Jython and IronPython.

  >  The str type itself can presently be used to process other
  >  character encodings:

Not really.  Remember, on input codecs always decode to Unicode and on
output they always encode from Unicode.  How do you propose to get
other encodings into the array of code units?


Here are two ways, there may be more: custom codecs, direct assignment


  >  [A "true Unicode" type] could be based on extensions to the
  >  existing str type, or it could be based on the array type, or it
  >  could based on the bytes type.  It could use an internal format of
  >  32-bit codepoints, PEP 393 variable-size codepoints, or 8- or
  >  16-bit codeunits.

In theory yes, but in practice all of the string methods and libraries
like re operate on str (and often but not always bytes; in particular,
codecs always decode from byte and encode to bytes).

Why bother with anything except arrays of code points at the start?
PEP 393 makes that time-efficient and reasonably space-efficient as a
starting point and allows starting with re or MRAB's regex to get
basic RE functionality or good UTS #18 functionality respectively.
Plus str already has all the usual string operations (.startswith(),
.join(), etc), and we have modules for dealing with the Unicode
Character Database.  Why waste effort reintegrating with all that,
until we have common use cases that need more efficient representation?


String methods could be reimplemented on any appropriate type, of 
course.  Rejecting alternatives too soon might make one miss the best 
design.



There would be some issue in coming up with an appropriate UTF-16 to
code point API for Jython and IronPython, but Terry Reedy has a rather
efficient library for that already.


Yes, Terry's implementation is interesting, and inspiring, and that 
concept could be extended to a variety of interesting techniques: 
codepoint access of code unit representations, and multi-codepoint 
character access on top of either code unit or codepoint representations.



So this discussion of alternative representations, including use of
high bits to represent properties, is premature optimization
... especially since we don't even have a proto-PEP specifying how
much conformance we want of this new "true Unicode" type in the first
place.

We need to focus on that before optimizing anything.


You may call it premature optimization if you like, or you can ignore 
the concepts and emails altogether.  I call it brainstorming for ideas, 
looking for non-obvious solutions to the problem of representation of 
Unicode.


I found your discussion of streams versus arrays, as separate concepts 
related to Unicode, along with Terry's bisect indexing implementation, 
to rather inspiring.  Just because Unicode defines streams of codeunits 
of various sizes (UTF-8, UTF-16, UTF-32) to represent characters when 
processes communicate and for storage (which is one way processes 
communicate), that doesn't imply that the internal representation of 
character strings in a programming language must use exactly that 
representation.  While there are efficiencies in using the same 
representation as is used by the communications streams, there are also 
inefficiencies.  I'm unaware of any current Python implementation that 
has chosen to use UTF-8 as the internal representation of character 
strings (I'm also aware Perl has made that choice), yet UTF-8 is one of 
the commonly recommend character representations on the Linux platform, 
from what I read.  So in that sense, Python has rejected the idea of 
using the "native" or "OS configured" representation as its internal 
representation.  So why, then, must one choose from a repertoire of 
Unicode-defined stream representations if they don't meet the goal of 
efficient length, indexing, or slicing operations on actual characters?
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman

On 8/31/2011 10:10 AM, Guido van Rossum wrote:

On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull
  wrote:
[me]

  >  That sounds like a contradiction -- it wouldn't be a UTF-16 array if
  >  you couldn't tell that it was using UTF-16.

Well, that's why I wrote "intended to be suggestive".  The Unicode
Standard does not specify at all what the internal representation of
characters may be, it only specifies what their external behavior must
be when two processes communicate.  (For "process" as used in the
standard, think "Python modules" here, since we are concerned with the
problems of folks who develop in Python.)  When observing the behavior
of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or
even UTF-32 arrays; only arrays of characters.

Hm, that's not how I would read "process". IMO that is an
intentionally vague term, and we are free to decide how to interpret
it. I don't think it will work very well to define a process as a
Python module; what about Python modules that agree about passing
along array of code units (or streams of UTF-8, for that matter)?

This is why I find the issue of Python, the language (and stdlib), as
a whole "conforming to the Unicode standard" such a troublesome
concept -- I think it is something that an application may claim, but
the language should make much more modest claims, such as "the regular
expression syntax supports features X, Y and Z from the Unicode
recommendation XXX, or "the UTF-8 codec will never emit a sequence of
bytes that is invalid according Unicode specification YYY". (As long
as the Unicode references are also versioned or dated.)

I'm fine with saying "it is hard to write Unicode-conforming
application code for reason ZZZ" and proposing a fix (e.g. PEP 393
fixes a specific complaint about code units being inferior to code
points for most types of processing). I'm not fine with saying "the
string datatype should conform to the Unicode standard".


Thus, according to the rules of handling a UTF-16 stream, it is an
error to observe a lone surrogate or a surrogate pair that isn't a
high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and
C8-C10).  That's what I mean by "can't tell it's UTF-16".

But if you can observe (valid) surrogate pairs it is still UTF-16.


And I
understand those requirements to mean that operations on UTF-16
streams should produce UTF-16 streams, or raise an error.  Without
that closure property for basic operations on str, I think it's a bad
idea to say that the representation of text in a str in a pre-PEP-393
"narrow" build is UTF-16.  For many users and app developers, it
creates expectations that are not fulfilled.

Ok, I dig this, to some extent. However saying it is UCS-2 is equally
bad. I guess this is why Java and .NET just say their string types
contain arrays of "16-bit characters", with essentially no semantics
attached to the word "character" besides "16-bit unsigned integer".

At the same time I think it would be useful if certain string
operations like .lower() worked in such a way that *if* the input were
valid UTF-16, *then* the output would also be, while *if* the input
contained an invalid surrogate, the result would simply be something
that is no worse (in particular, those are all mapped to themselves).
We could even go further and have .lower() and friends look at
graphemes (multi-code-point characters) if the Unicode std has a
useful definition of e.g. lowercasing graphemes that differed from
lowercasing code points.

An analogy is actually found in .lower() on 8-bit strings in Python 2:
it assumes the string contains ASCII, and non-ASCII characters are
mapped to themselves. If your string contains Latin-1 or EBCDIC or
UTF-8 it will not do the right thing. But that doesn't mean strings
cannot contain those encodings, it just means that the .lower() method
is not useful if they do. (Why ASCII? Because that is the system
encoding in Python 2.)


So if Python 3.3+ uses Unicode codepoints as its str representation, the 
analogy to ASCII and Python 2 would imply that it should permit 
out-of-range codepoints, if they can be represented in the underlying 
data values.  Valid codecs would not create such on input, and Valid 
codecs would not accept such on output.  Operations on codepoints 
should, like .lower(), use the identity operation when applied to 
non-codepoints.





It's true that common usage is that an array of code units that
usually conforms to UTF-16 may be called "UTF-16" without the closure
properties.  I just disagree with that usage, because there are two
camps that interpret "UTF-16" differently.  One side says, "we have an
array representation in UTF-16 that can handle all Unicode code points
efficiently, and if you think you need more, think again", while the
other says "it's too painful to have to check every result for valid
UTF-16, and we need a UTF-16 type that supports the usual array
operations on *characters* via the usual operators; if you think
otherwis

Re: [Python-Dev] Python 3 optimizations continued...

2011-08-31 Thread Cesare Di Mauro
2011/8/31 stefan brunthaler 

> > I think that you must deal with big endianess because some RISC can't
> handle
> > at all data in little endian format.
> >
> > In WPython I have wrote some macros which handle both endianess, but
> lacking
> > big endian machines I never had the opportunity to verify if something
> was
> > wrong.
> >
> I am sorry for the temporal lapse of not getting back to this directly
> yesterday, we were just heading out for lunch and I figured it only
> out then but immediately forgot it on our way back to the lab...
>
> So, as I have already said, I evaluated my optimizations on x86
> (little-endian) and PowerPC 970 (big-endian) and I did not have to
> change any of my instruction decoding during interpretation. (The only
> nasty bug I still remember vividly was that while on gcc for x86 the
> data type char defaults to signed, whereas it defaults to unsigned on
> PowerPC's gcc.) When I have time and access to a PowerPC machine again
> (an ARM might be interesting, too), I will take a look at the
> generated assembly code to figure out why this is working. (I have
> some ideas why it might work without changing the code.)
>
> If I run into any problems, I'll gladly contact you :)
>
> BTW: AFAIR, we emailed last year regarding wpython and IIRC your
> optimizations could primarily be summarized as clever
> superinstructions. I have not implemented anything in that area at all
> (and have in fact not even touched the compiler and its peephole
> optimizer), but if parts my implementation gets in, I am sure that you
> could add some of your work on top of that, too.
>
>  Cheers,
> --stefan
>

You're right. I took a look at our old e-mails, and I found more details
about your work. It's definitely not affected by processor endianess, so you
don't need any check: it just works, because you'll produce the new opcodes
in memory, and consume them in memory as well.

Looking at your examples, I think that WPython wordcodes usage can be useful
only for the most simple ones. That's because superinstructions group
together several actions that need to be splitted again to simpler ones by a
tracing-JIT/compiler like your, if you want to keep it simple. You said that
you added about 400 specialized instructions last year with the usual
bytecodes, but wordcodes will require quite more (this can compromise
performance on CPU with small data caches).

So I think that it'll be better to finish your work, with all tests passed,
before thinking about adding something on top (that, for me, sounds like a
machine code JIT O:-)

Regards,
Cesare
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-31 Thread Cesare Di Mauro
2011/8/31 Guido van Rossum 

> On Tue, Aug 30, 2011 at 10:04 PM, Cesare Di Mauro
>  wrote:
> > It isn't, because motivation to do something new with CPython vanishes,
> at
> > least on some areas (virtual machine / ceval.c), even having some ideas
> to
> > experiment with. That's why in my last talk on EuroPython I decided to
> move
> > on other areas (Python objects).
>
> Cesare, I'm really sorry that you became so disillusioned that you
> abandoned wordcode. I agree that we were too optimistic about Unladen
> Swallow. Also that the existence of PyPy and its PR machine (:-)
> should not stop us from improving CPython.
>

I never stopped thinking about new optimization. A lot can be made on
CPython, even without resorting to something like JIT et all.

>
> I'm wondering if, with your experience in creating WPython, you could
> review Stefan Brunthaler's code and approach (once he's put it up for
> review) and possibly the two of you could even work on a joint
> project?
>
> --
> --Guido van Rossum (python.org/~guido)
>


Yes, I can. I'll wait for Stefan to update its source (reaching Python 3.2
at least) as he has intended to do, and that everything is published, in
order to review the code.

I also agree with you that right now it doesn't need to look as
state-of-the-art. First make it work, then make it nicer. ;)

Regards,
Cesare
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Terry Reedy

On 8/31/2011 1:10 PM, Guido van Rossum wrote:


This is why I find the issue of Python, the language (and stdlib), as
a whole "conforming to the Unicode standard" such a troublesome
concept -- I think it is something that an application may claim, but
the language should make much more modest claims, such as "the regular
expression syntax supports features X, Y and Z from the Unicode
recommendation XXX, or "the UTF-8 codec will never emit a sequence of
bytes that is invalid according Unicode specification YYY". (As long
as the Unicode references are also versioned or dated.)


This will be a great improvement. It was both embarrassing and 
frustrating to have to respond to Tom C.'s (and other's) issue with "Our 
unicode type is too vaguely documented to tell whether you are reporting 
a bug or making a feature request.



But if you can observe (valid) surrogate pairs it is still UTF-16.

...

Ok, I dig this, to some extent. However saying it is UCS-2 is equally
bad.


As I said on the tracker, our narrow builds are in-between (while moving 
closer to UTF-16), and both terms are deceptive, at least to some.



At the same time I think it would be useful if certain string
operations like .lower() worked in such a way that *if* the input were
valid UTF-16, *then* the output would also be, while *if* the input
contained an invalid surrogate, the result would simply be something
that is no worse (in particular, those are all mapped to themselves).
We could even go further and have .lower() and friends look at
graphemes (multi-code-point characters) if the Unicode std has a
useful definition of e.g. lowercasing graphemes that differed from
lowercasing code points.

An analogy is actually found in .lower() on 8-bit strings in Python 2:
it assumes the string contains ASCII, and non-ASCII characters are
mapped to themselves. If your string contains Latin-1 or EBCDIC or
UTF-8 it will not do the right thing. But that doesn't mean strings
cannot contain those encodings, it just means that the .lower() method
is not useful if they do. (Why ASCII? Because that is the system
encoding in Python 2.)


Good analogy.


Let's call those things graphemes (Tom C's term, I quite like leaving
"character" ambiguous) -- they are sequences of multiple code points
that represent a single "visual squiggle" (the kind of thing that
you'd want to be swappable in vim with "xp" :-). I agree that APIs are
needed to manipulate (match, generate, validate, mutilate, etc.)
things at the grapheme level. I don't agree that this means a separate
data type is required.


I presume by 'separate data type' you mean a base level builtin class 
like int or str and that you would allow for wrapper classes built on 
top of str, as such are not really 'separate'. For grapheme leval and 
higher, we should certainly start with wrappers and probably with 
alternate versions based on different strategies.



There are ever-larger units of information
encoded in text strings, with ever farther-reaching (and more vague)
requirements on valid sequences. Do you want to have a data type that
can represent (only valid) words in a language? Sentences? Novels?

...

I think that at this point in time the best we can do is claim that
Python (the language standard) uses either 16-bit code units or 21-bit
code points in its string datatype, and that, thanks to PEP 393,
CPython 3.3 and further will always use 21-bit code points (but Jython
and IronPython may forever use their platform's native 16-bit code
unit representing string type). And then we add APIs that can be used
everywhere to look for code points (even if the string contains code
points), graphemes, or larger constructs. I'd like those APIs to be
designed using a garbage-in-garbage-out principle, where if the input
conforms to some Unicode requirement, the output does too, but if the
input doesn't, the output does what makes most sense. Validation is
then limited to codecs, and optional calls.

If you index or slice a string, or create a string from chr() of a
surrogate or from some other value that the Unicode standard considers
an illegal code point, you better know what you are doing. I want
chr(i) to be valid for all values of i in range(2**21),


Actually, it is range(0X11) == range(1114112) so that UTF-8 uses at 
most 4 bytes per codepoint. 21 bits is 20.1 bits rounded up.



so it can be
used to create a lone surrogate, or (on systems with 16-bit
"characters") a surrogate pair. And also ord(chr(i)) == i for all i in
range(2**21).


for i in range(0x11):  # 1114112
if ord(chr(i)) != i:
print(i)
# prints nothing (on Windows)

> I'm not sure about ord() on a 2-character string

containing a surrogate pair on systems where strings contain 21-bit
code points; I think it should be an error there, just as ord() on
other strings of length != 1. But on systems with 16-bit "characters",
ord() of strings of length 2 containing a valid surrogate pair should
work.


And now does, thanks to who

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Nick Coghlan
On Thu, Sep 1, 2011 at 8:02 AM, Terry Reedy  wrote:
> On 8/31/2011 1:10 PM, Guido van Rossum wrote:
>> Ok, I dig this, to some extent. However saying it is UCS-2 is equally
>> bad.
>
> As I said on the tracker, our narrow builds are in-between (while moving
> closer to UTF-16), and both terms are deceptive, at least to some.

We should probably just explicitly document that the internal
representation in narrow builds is a UCS-2/UTF-16 hybrid - like
UTF-16, it can handle the full code point space, but, like UCS-2, it
allows code unit sequences (such as lone surrogates) that strict
UTF-16 would reject.

Perhaps we should also finally split strings out to a dedicated
section on the same tier as Sequence types in the library reference.
Yes, they're sequences, but they're also so much more than that (try
as you might, you're unlikely to be successful in ducktyping strings
the way you can sequences, mappings, files, numbers and other
interfaces. Needing a "real string" is even more common than needing a
"real dict", especially after the efforts to make most parts of the
interpreter that previously cared about the latter distinction accept
arbitrary mapping objects).

I've created http://bugs.python.org/issue12874, suggesting that the
"Sequence Types" and "memoryview type" sections could be usefully
rearranged as:

Sequence Types - list, tuple, range
Text Data - str
Binary Data - bytes, bytearray, memoryview

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-08-31 Thread Nick Coghlan
On Thu, Sep 1, 2011 at 3:28 AM, Guido van Rossum  wrote:
> On Tue, Aug 30, 2011 at 10:04 PM, Cesare Di Mauro
> Cesare, I'm really sorry that you became so disillusioned that you
> abandoned wordcode. I agree that we were too optimistic about Unladen
> Swallow. Also that the existence of PyPy and its PR machine (:-)
> should not stop us from improving CPython.

Yep, and I'll try to do a better job of discouraging creeping
complexity (without adequate payoffs) without the harmful side effect
of discouraging experimentation with CPython performance improvements
in general.

It's massive "rewrite the world" changes, that don't adequately
account for all the ways CPython gets used or the fact that core devs
need to be able to effectively *review* the changes, that are unlikely
to ever get anywhere. More localised changes, or those that are
relatively easy to explain have a much better chance.

So I'll switch my tone to just trying to make sure that portability
and maintainability concerns are given due weight :)

Cheers,
Nick.

P.S. I suspect a big part of my attitude stems from the fact that
we're still trying to untangle some of the consequences of committing
the PEP 3118 new buffer API implementation with inadequate review (it
turns out the implementation didn't reflect the PEP and the PEP had
deficiencies of its own), and I was one of the ones advocating in
favour of that patch. Once bitten, twice shy, etc.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Neil Hodgson
Glenn Linderman:

> That said, regexp, or some sort of cursor on a string, might be a workable
> solution.  Will it have adequate performance?  Perhaps, at least for some
> applications.  Will it be as conceptually simple as indexing an array of
> graphemes?  No.  Will it ever reach the efficiency of indexing an array of
> graphemes? No.  Does that matter? Depends on the application.

   Using an iterator for cluster access is a common technique
currently. For example, with the Pango text layout and drawing
library, you may create a PangoLayoutIter over a text layout object
(which contains a UTF-8 string along with formatting information) and
iterate by clusters by calling pango_layout_iter_next_cluster. Direct
access to clusters by index is not as useful in this domain as access
by pixel positions - for example to examine the portion of a layout
visible in a window.

   
http://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-get-iter
   In this API, 'index' is used to refer to a byte index into UTF-8,
not a character or cluster index.

   Rather than discuss functionality in the abstract, we need some use
cases involving different levels of character and cluster access to
see whether providing indexed access is worthwhile. I'll start with an
example: some text drawing engines draw decomposed characters ("o"
followed by " ̈" -> "ö") differently compared to their composite
equivalents ("ö") and this may be perceived as better or worse. I'd
like to offer an option to replace some decomposed characters with
their composite equivalent before drawing but since other characters
may look worse, I don't want to do a full normalization. The API style
that appears most useful for this example is an iterator over the
input string that yields composed and decomposed character strings
(that is, it will yield both "ö" and "ö"), each character string is
then converted if in a substitution dictionary and written to an
output string. This is similar to an iterator over grapheme clusters
although, since it is only aimed at composing sequences, the iterator
could be simpler than a full grapheme cluster iterator.

   One of the benefits of iterator access to text is that many
different iterators can be built without burdening the implementation
object with extra memory costs as would be likely with techniques that
build indexes into the representation.

   Neil
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 5:58 PM, Neil Hodgson  wrote:
> [...] some text drawing engines draw decomposed characters ("o"
> followed by " ̈" -> "ö") differently compared to their composite
> equivalents ("ö") and this may be perceived as better or worse. I'd
> like to offer an option to replace some decomposed characters with
> their composite equivalent before drawing but since other characters
> may look worse, I don't want to do a full normalization.

Isn't this an issue properly solved by various normal forms?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Hagen Fürstenau
>> [...] some text drawing engines draw decomposed characters ("o"
>> followed by " ̈" -> "ö") differently compared to their composite
>> equivalents ("ö") and this may be perceived as better or worse. I'd
>> like to offer an option to replace some decomposed characters with
>> their composite equivalent before drawing but since other characters
>> may look worse, I don't want to do a full normalization.
> 
> Isn't this an issue properly solved by various normal forms?

I think he's rather describing the need for custom "abnormal forms".

- Hagen

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Neil Hodgson
Guido van Rossum:

> On Wed, Aug 31, 2011 at 5:58 PM, Neil Hodgson  wrote:
>> [...] some text drawing engines draw decomposed characters ("o"
>> followed by " ̈" -> "ö") differently compared to their composite
>> equivalents ("ö") and this may be perceived as better or worse. I'd
>> like to offer an option to replace some decomposed characters with
>> their composite equivalent before drawing but since other characters
>> may look worse, I don't want to do a full normalization.
>
> Isn't this an issue properly solved by various normal forms?

   No, since normalization of all cases may actually lead to worse
visuals in some situations. A potential reason for drawing decomposed
characters differently is that more room may be allocated for the
generic condition where a character may be combined with a wide
variety of accents compared with combining it with a specific accent.

   Here is an example on Windows drawing composite and decomposed
forms to show the types of difference often encountered.
http://scintilla.org/Composite.png
   Now, this particular example displays both forms quite reasonably
so would not justify special processing but I have seen on other
platforms and earlier versions of Windows where the umlaut in the
decomposed form is displaced to the right even to the extent of
disappearing under the next character. In the example, the decomposed
'o' is shorter and lighter and the umlauts are round instead of
square.

   Neil
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Guido van Rossum
On Wed, Aug 31, 2011 at 6:29 PM, Neil Hodgson  wrote:
> Guido van Rossum:
>
>> On Wed, Aug 31, 2011 at 5:58 PM, Neil Hodgson  wrote:
>>> [...] some text drawing engines draw decomposed characters ("o"
>>> followed by " ̈" -> "ö") differently compared to their composite
>>> equivalents ("ö") and this may be perceived as better or worse. I'd
>>> like to offer an option to replace some decomposed characters with
>>> their composite equivalent before drawing but since other characters
>>> may look worse, I don't want to do a full normalization.
>>
>> Isn't this an issue properly solved by various normal forms?
>
>   No, since normalization of all cases may actually lead to worse
> visuals in some situations. A potential reason for drawing decomposed
> characters differently is that more room may be allocated for the
> generic condition where a character may be combined with a wide
> variety of accents compared with combining it with a specific accent.

Ok, I thought there was also a form normalized (denormalized?) to
decomposed form. But I'll take your word.

>   Here is an example on Windows drawing composite and decomposed
> forms to show the types of difference often encountered.
> http://scintilla.org/Composite.png
>   Now, this particular example displays both forms quite reasonably
> so would not justify special processing but I have seen on other
> platforms and earlier versions of Windows where the umlaut in the
> decomposed form is displaced to the right even to the extent of
> disappearing under the next character. In the example, the decomposed
> 'o' is shorter and lighter and the umlauts are round instead of
> square.

I'm not sure it's a good idea to try and improve on the font using
such a hack. But I won't deny you have the right. :-)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-31 Thread Glenn Linderman

On 8/31/2011 5:58 PM, Neil Hodgson wrote:

Glenn Linderman:


That said, regexp, or some sort of cursor on a string, might be a workable
solution.  Will it have adequate performance?  Perhaps, at least for some
applications.  Will it be as conceptually simple as indexing an array of
graphemes?  No.  Will it ever reach the efficiency of indexing an array of
graphemes? No.  Does that matter? Depends on the application.

Using an iterator for cluster access is a common technique
currently. For example, with the Pango text layout and drawing
library, you may create a PangoLayoutIter over a text layout object
(which contains a UTF-8 string along with formatting information) and
iterate by clusters by calling pango_layout_iter_next_cluster. Direct
access to clusters by index is not as useful in this domain as access
by pixel positions - for example to examine the portion of a layout
visible in a window.


http://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-get-iter
In this API, 'index' is used to refer to a byte index into UTF-8,
not a character or cluster index.


I agree that different applications may have different needs for 
different types of indexes to various starting points in a large 
string.  Where a custom index is required, a standard index may not be 
needed.



One of the benefits of iterator access to text is that many
different iterators can be built without burdening the implementation
object with extra memory costs as would be likely with techniques that
build indexes into the representation.


How many different iterators into the same text would be concurrently 
needed by an application?  And why?  Seems like if it is dealing with 
text at the level of grapheme clusters, it needs that type of iterator.  
Of course, if it does I/O it needs codec access, but that is by nature 
sequential from the starting point to the end point.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com