Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Chris Angelico
On Fri, Jul 14, 2017 at 4:30 PM, Marko Rauhamaa  wrote:
> Unicode was supposed to get us out of the 8-bit locale hole. Now it
> seems the Unicode hole is far deeper and we haven't reached the bottom
> of it yet. I wonder if the hole even has a bottom.
>
> We now have:
>
>  - an encoding: a sequence a bytes
>
>  - a string: a sequence of integers (code points)
>
>  - "a snippet of text": a sequence of characters

Before Unicode, we had exactly the same thing, only with more encodings.

> Assuming "a sequence of characters" is the final word, and Python wants
> to be involved in that business, one must question the usefulness of
> strings, which are neither here nor there.
>
> When people use Unicode, they are expecting to be able to deal in real
> characters. I would expect:
>
>len(text)   to give me the length in characters
>text[-1]to evaluate to the last character
>re.match("a.c", text)   to match a character between a and c
>
> So the question is, should we have a third type for text. Or should the
> semantics of strings be changed to be based on characters?

What is the length of a string? How often do you actually care about
the number of grapheme clusters - and not, for example, about the
pixel width? (To columnate text, for instance, you need to know about
its width in pixels or millimeters, not the number of characters in
the line.) And if you're going to group code points together because
some of them are combining characters, would you also group them
together because there's a zero-width joiner in the middle? The answer
will sometimes be "yes of course" and sometimes "of course not". These
kinds of linguistic considerations shouldn't be codified into the core
of the language.

IMO the Python str type is adequate as a core data type. What we may
need, though, is additional utility functions, eg:

* unicodedata.grapheme_clusters(str) - split str into a sequence of
grapheme clusters
* pango.get_text_extents(str) - measure the pixel dimensions of a line of text
* platform.punish_user() - issue a platform-dependent response (such
as an electric shock, a whack with a 2x4, or a dropped anvil) on
someone who has just misunderstood Unicode again
* socket.punish_user() - as above, but to the user at the opposite end
of a socket

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Read Application python logs

2017-07-14 Thread neel patel
Hi,

I wrote one simple C code and integrated python interpreter.

I am using Python C API to run the python command.

Below code used Python C API inside .c file.


PyObject* PyFileObject = PyFile_FromString("test.py", (char *)"r");

int ret = PyRun_SimpleFile(PyFile_AsFile(PyFileObject), "test.py");

if (ret != 0)

print("Error\n");


Above code working fine. It runs the "test.py" but inside "test.py" there
is some print statement so how can i read those messages in this .c file
from console ?

Thanks in Advance.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Marko Rauhamaa
Chris Angelico :

> On Fri, Jul 14, 2017 at 4:30 PM, Marko Rauhamaa  wrote:
>> When people use Unicode, they are expecting to be able to deal in real
>> characters. I would expect:
>>
>>len(text)   to give me the length in characters
>>text[-1]to evaluate to the last character
>>re.match("a.c", text)   to match a character between a and c
>>
>> So the question is, should we have a third type for text. Or should the
>> semantics of strings be changed to be based on characters?
>
> What is the length of a string? How often do you actually care about
> the number of grapheme clusters - and not, for example, about the
> pixel width?

A good question. I have in the past argued that the string should be a
special data type for the specialist text processing needs.

However, I happen to have fooled around with a character-graphics based
game in recent days, and even professionally, I use character-based
alignment quite often. Consider, for example, a Python source code
editor where you want to limit the length of the line based on the
number of characters more typically than based on the number of pixels.

Furthermore, you only dismissed my question about

   len(text)

What about

   text[-1]
   re.match("a.c", text)


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Chris Angelico
On Fri, Jul 14, 2017 at 6:15 PM, Marko Rauhamaa  wrote:
> Chris Angelico :
>
>> On Fri, Jul 14, 2017 at 4:30 PM, Marko Rauhamaa  wrote:
>>> When people use Unicode, they are expecting to be able to deal in real
>>> characters. I would expect:
>>>
>>>len(text)   to give me the length in characters
>>>text[-1]to evaluate to the last character
>>>re.match("a.c", text)   to match a character between a and c
>>>
>>> So the question is, should we have a third type for text. Or should the
>>> semantics of strings be changed to be based on characters?
>>
>> What is the length of a string? How often do you actually care about
>> the number of grapheme clusters - and not, for example, about the
>> pixel width?
>
> A good question. I have in the past argued that the string should be a
> special data type for the specialist text processing needs.
>
> However, I happen to have fooled around with a character-graphics based
> game in recent days, and even professionally, I use character-based
> alignment quite often. Consider, for example, a Python source code
> editor where you want to limit the length of the line based on the
> number of characters more typically than based on the number of pixels.
>
> Furthermore, you only dismissed my question about
>
>len(text)
>
> What about
>
>text[-1]
>re.match("a.c", text)

The considerations and concerns in the second half of my paragraph -
the bit you didn't quote - directly address these two.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Marko Rauhamaa
Chris Angelico :

> On Fri, Jul 14, 2017 at 6:15 PM, Marko Rauhamaa  wrote:
>> Furthermore, you only dismissed my question about
>>
>>len(text)
>>
>> What about
>>
>>text[-1]
>>re.match("a.c", text)
>
> The considerations and concerns in the second half of my paragraph -
> the bit you didn't quote - directly address these two.

I guess you refer to:

   These kinds of linguistic considerations shouldn't be codified into
   the core of the language.

Then, why bother with Unicode to begin with? Why not just use bytes?
After all, Python3's strings have the very same pitfalls:

  - you don't know the length of a text in characters

  - chr(n) doesn't return a character

  - you can't easily find the 7th character in a piece of text

  - you can't compare the equality of two pieces of text

  - you can't use a piece of text as a reliable dict key

etc.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Chris Angelico
On Fri, Jul 14, 2017 at 6:53 PM, Marko Rauhamaa  wrote:
> Chris Angelico :
>
>> On Fri, Jul 14, 2017 at 6:15 PM, Marko Rauhamaa  wrote:
>>> Furthermore, you only dismissed my question about
>>>
>>>len(text)
>>>
>>> What about
>>>
>>>text[-1]
>>>re.match("a.c", text)
>>
>> The considerations and concerns in the second half of my paragraph -
>> the bit you didn't quote - directly address these two.
>
> I guess you refer to:
>
>These kinds of linguistic considerations shouldn't be codified into
>the core of the language.

No, I don't. I refer to the second half of the paragraph you quoted
the first half of.

> Then, why bother with Unicode to begin with? Why not just use bytes?
> After all, Python3's strings have the very same pitfalls:
>
>   - you don't know the length of a text in characters
>
>   - chr(n) doesn't return a character
>
>   - you can't easily find the 7th character in a piece of text

First you have to define "character". There are enough different
definitions of "character" (for the purposes of
counting/iteration/subscripting) that at least some of them have to be
separate functions or methods.

>   - you can't compare the equality of two pieces of text
>
>   - you can't use a piece of text as a reliable dict key

(Dict key usage is defined in terms of equality, so these two are the
same concern.)

Yes, you can. For most purposes, textual equality should be defined in
terms of NFC or NFD normalization. Python already gives you that. You
could argue that a string should always be stored NFC (or NFD, take
your pick), and then the equality operator would handle this; but I'm
not sure the benefit is worth it.

And you can't define equality by whether two strings would display
identically, because then you lose semantic information (for instance,
the difference between U+0020 and U+00A0, or between U+2004 and a pair
of U+2006, or between U+004B and U+041A), not to mention the way that
some fonts introduce confusing similarities that other fonts don't.

If you're trying to use strings as identifiers in any way (say, file
names, or document lookup references), using the NFC/NFD normalized
form of the string should be sufficient.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Marko Rauhamaa
Chris Angelico :

> On Fri, Jul 14, 2017 at 6:53 PM, Marko Rauhamaa  wrote:
>> Chris Angelico :
>> Then, why bother with Unicode to begin with? Why not just use bytes?
>> After all, Python3's strings have the very same pitfalls:
>>
>>   - you don't know the length of a text in characters
>>   - chr(n) doesn't return a character
>>   - you can't easily find the 7th character in a piece of text
>
> First you have to define "character".

I'm referring to the

Grapheme clusters, a.k.a.real characters

>>   - you can't compare the equality of two pieces of text
>>   - you can't use a piece of text as a reliable dict key
>
> (Dict key usage is defined in terms of equality, so these two are the
> same concern.)

Ideally, yes. However, someone might say, "don't use == to compare
equality; use unicode.textually_equal() instead". That advise might
satisfy the first requirement but not the second.

> Yes, you can. For most purposes, textual equality should be defined in
> terms of NFC or NFD normalization. Python already gives you that. You
> could argue that a string should always be stored NFC (or NFD, take
> your pick), and then the equality operator would handle this; but I'm
> not sure the benefit is worth it.

As I said, Python3's strings are neither here nor there. They don't
quite solve the problem Python2's strings had. They will push the
internationalization problems a bit farther out but fall short of the
mark.

he developer still has to worry a lot. Unicode seemingly solved one
problem only to present the developer of a bagful of new problems.

And if Python3's strings are a half-measure, why not stick to bytes?

> If you're trying to use strings as identifiers in any way (say, file
> names, or document lookup references), using the NFC/NFD normalized
> form of the string should be sufficient.

Show me ten Python3 database applications, and I'll show you ten Python3
database applications that don't normalize their primary keys.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Marko Rauhamaa
Marko Rauhamaa :

> Chris Angelico :
>> If you're trying to use strings as identifiers in any way (say, file
>> names, or document lookup references), using the NFC/NFD normalized
>> form of the string should be sufficient.
>
> Show me ten Python3 database applications, and I'll show you ten Python3
> database applications that don't normalize their primary keys.

Besides the normal forms don't help you do text processing (no regular
expression matching, no simple way to get a real character).


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Chris Angelico
On Fri, Jul 14, 2017 at 8:59 PM, Marko Rauhamaa  wrote:
> Chris Angelico :
>
>> On Fri, Jul 14, 2017 at 6:53 PM, Marko Rauhamaa  wrote:
>>> Chris Angelico :
>>> Then, why bother with Unicode to begin with? Why not just use bytes?
>>> After all, Python3's strings have the very same pitfalls:
>>>
>>>   - you don't know the length of a text in characters
>>>   - chr(n) doesn't return a character
>>>   - you can't easily find the 7th character in a piece of text
>>
>> First you have to define "character".
>
> I'm referring to the
>
> Grapheme clusters, a.k.a.real characters

Okay. Just as long as you know that that's not the only valid definition.

>> Yes, you can. For most purposes, textual equality should be defined in
>> terms of NFC or NFD normalization. Python already gives you that. You
>> could argue that a string should always be stored NFC (or NFD, take
>> your pick), and then the equality operator would handle this; but I'm
>> not sure the benefit is worth it.
>
> As I said, Python3's strings are neither here nor there. They don't
> quite solve the problem Python2's strings had. They will push the
> internationalization problems a bit farther out but fall short of the
> mark.
>
> he developer still has to worry a lot. Unicode seemingly solved one
> problem only to present the developer of a bagful of new problems.
>
> And if Python3's strings are a half-measure, why not stick to bytes?

Python's float type can't represent all possible non-integer values.
If it's such a half-measure, why not stick to integers and do all your
own fraction handling?

>> If you're trying to use strings as identifiers in any way (say, file
>> names, or document lookup references), using the NFC/NFD normalized
>> form of the string should be sufficient.
>
> Show me ten Python3 database applications, and I'll show you ten Python3
> database applications that don't normalize their primary keys.

I don't have ten open source ones handy, but I can tell you for sure
that I've worked with far more than ten that don't NEED to normalize
their primary keys. Why? Because they are *by definition* normal
already. Mostly because they use integers for keys. Tada!
Normalization is unnecessary.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Chris Angelico
On Fri, Jul 14, 2017 at 10:05 PM, Marko Rauhamaa  wrote:
> Marko Rauhamaa :
>
>> Chris Angelico :
>>> If you're trying to use strings as identifiers in any way (say, file
>>> names, or document lookup references), using the NFC/NFD normalized
>>> form of the string should be sufficient.
>>
>> Show me ten Python3 database applications, and I'll show you ten Python3
>> database applications that don't normalize their primary keys.
>
> Besides the normal forms don't help you do text processing (no regular
> expression matching, no simple way to get a real character).

What do you mean about regular expressions? You can use REs with
normalized strings. And if you have any valid definition of "real
character", you can use it equally on an NFC-normalized or
NFD-normalized string than any other. They're just strings, you know.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Steve D'Aprano
On Fri, 14 Jul 2017 04:30 pm, Marko Rauhamaa wrote:

> Unicode was supposed to get us out of the 8-bit locale hole.

Which it has done. Apart from use for backwards compatibility, there is no good
reason to use to use the masses of legacy extensions to ASCII or the technical
fragile non-Unicode multibyte encodings from China and Japan.

Backwards compatibility is important, but for new content we should all support
Unicode.


> Now it 
> seems the Unicode hole is far deeper and we haven't reached the bottom
> of it yet. I wonder if the hole even has a bottom.

This is not a Unicode hole. This is a human languages hole, compounded by the
need for backwards compatibility with legacy encodings.


> We now have:
> 
>  - an encoding: a sequence a bytes
>
>  - a string: a sequence of integers (code points)
> 
>  - "a snippet of text": a sequence of characters

I'm afraid that's wrong, and much too simplified. What we have had, ever since
computers started having standards for the storage and representation of text
(i.e. since EBCDIC at the very least, possibly even earlier), is:

(1) A **character set** made up of some collection of:

- alphabetical letters, characters, syllabograms, ideographs or logographs
- digits and other numeric symbols
- punctuation marks
- other textual marks, including diacritics ("accent marks")
- assorted symbols, icons, pictograms or hieroglyphics
- control and formatting codes
- white space and other text separators
- and any other entities that have text-like semantics.

The character set is the collection of entities we would like to represent as
computer data. But of course computers can't store "the letter Aye" A or "the
letter Zhe" Ж so we also need:

(2) A (possibly implicit) mapping between the entities in the character 
set and some contiguous range of abstract numeric values ("code points").

(3) The **encoding**, an explicit mapping between those abstract code points
and some concrete representation suitable for use as storage or transmission
by computers. That is usually which means a sequence of "code units", where
each code unit is typically one, two or four bytes.

Note that a single character set could have multiple encodings.

In pre-Unicode encodings such as ASCII, the difference between (1) and (2) was
frequently (always?) glossed over. For example, in ASCII:

- the character set was made up of 128 control characters, American English
  letters, digits and punctuation marks;

- there is an implicit mapping between (say) "character A is code point 65";

- there is also an explicit mapping between "character A (i.e. code point 65)
  is byte 0x41 (decimal 65)".

So the legacy character set and encoding standards helped cause confusion, by
implying that "characters are bytes" instead of making the difference explicit.

In addition, we have:

(4) Strings, ropes and other data structures suitable for the storage of
**sequences of code points** (characters, codes, symbols etc); strings
being the simplest implementation (a simple array of code units), but
they're not the only one.

We also have:

(5) Human-meaningful chunks of text: characters, graphemes, words, sentences, 
symbols, paragraphs, pages, sections, chapters, snippets or what have you.


There's no direct one-to-one correspondence between (5) and (4). A string can
just as easily contain half a word "aard" as a full word "aardvark".

And let's not forget:

(6) The **glyphs** of each letter, symbol, etc, encompassing the visual shape
and design of those chunks of text, which can depend on the context. For
example, the Greek letter sigma looks different depending on whether it
is at the end of a word or not.


> Assuming "a sequence of characters" is the final word, 

Why would you assume that? Let's start with, what's a character?


> and Python wants 
> to be involved in that business, one must question the usefulness of
> strings, which are neither here nor there.

Sure, you can question anything you like, its a free country[1], but unless you
have a concrete plan for something better and are willing to implement it, the
chances are very high that nothing will happen.

The vast majority of programming languages provide only a set of low-level
primitives for manipulating strings, with no semantic meaning enforced. If you
want to give *human meaning* to your strings, you need something more than just
the string-handling primitives your computer language provides. This was just
as true in the old days of ASCII as it is today with Unicode: your computer
language is just as happy making a string containing the nonsense word 
"vxtsEpdlu" as the real word "Norwegian".


> When people use Unicode, they are expecting to be able to deal in real
> characters.

Then their expectations are too high and they are misinformed. Unicode is not a
standard for implementing human-meaningful text (although it takes a few steps
towards such a standard).

Unic

Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Marko Rauhamaa
Steve D'Aprano :

> These are only a *few* of the *easy* questions that need to be
> answered before we can even consider your question:
>
>> So the question is, should we have a third type for text. Or should
>> the semantics of strings be changed to be based on characters?

Sure, but if they can't be answered, what good is there in having
strings (as opposed to bytes). What problem do strings solve? What
operation depends on (or is made simpler) by having strings (instead of
bytes)?

We are not even talking about some exotic languages, but the problem is
right there in the middle of Latin-1. We can't even say what

len("è")

should return. And we may experience:

>>> ord("è")Traceback (most recent call last):
  File "", line 1, in 
TypeError: ord() expected a character, but string of length 2 found

Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?

As it stands, we have

   è --[encode>-- Unicode --[reencode>-- UTF-8

Why is one encoding format better than the other?


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Rhodri James

On 14/07/17 14:31, Marko Rauhamaa wrote:

Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?


Speaking as someone who has been up to his elbows in this recently, I 
would say emphatically that it does make things worse.  It adds an extra 
layer of complexity to all of the questions you were asking, and more. 
A single codepoint is a meaningful thing, even if its meaning may be 
modified by combining.  A single byte may or may not be meaningful.


--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list


PYTHON GDAL

2017-07-14 Thread jorge . conrado



Hi,

I installed the GDAL 2.2.1 using conda. Then I did:

import gdal

and I had:


Traceback (most recent call last):
  File "", line 1, in 
  File "/home/conrado/miniconda2/lib/python2.7/site-packages/gdal.py", 
line 2, in 

from osgeo.gdal import deprecation_warn
  File 
"/home/conrado/miniconda2/lib/python2.7/site-packages/osgeo/__init__.py", 
line 21, in 

_gdal = swig_import_helper()
  File 
"/home/conrado/miniconda2/lib/python2.7/site-packages/osgeo/__init__.py", 
line 17, in swig_import_helper

_mod = imp.load_module('_gdal', fp, pathname, description)
ImportError: libicui18n.so.56: cannot open shared object file: No such 
file or directory



then I used the command find:

find . -name 'libicui18n.so.56' -print

and I had:

./usr/local/lib/python3.6/site-packages/PyQt5/Qt/lib/libicui18n.so.56


Please, what can I do to put (set) this lib for python2.7 recognize it.


Thanks,


Conrado
--
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Marko Rauhamaa
Rhodri James :

> On 14/07/17 14:31, Marko Rauhamaa wrote:
>> Of course, UTF-8 in a bytes object doesn't make the situation any
>> better, but does it make it any worse?
>
> Speaking as someone who has been up to his elbows in this recently, I
> would say emphatically that it does make things worse. It adds an
> extra layer of complexity to all of the questions you were asking, and
> more. A single codepoint is a meaningful thing, even if its meaning
> may be modified by combining. A single byte may or may not be
> meaningful.

I'd like to understand this better. Maybe you have a couple of examples
to share?


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Michael Torrie
On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:
> Of course, UTF-8 in a bytes object doesn't make the situation any
> better, but does it make it any worse?

> 
> As it stands, we have
> 
>è --[encode>-- Unicode --[reencode>-- UTF-8
> 
> Why is one encoding format better than the other?

This is precisely the logic behind Google using UTF-8 for strings in Go,
rather than having some O(1) abstract type like Python has.  And many
other languages do the same.  The argument is that because of the very
issues that you mention, having O(1) lookup in a string isn't that
important, since looking up a particular index in a unicode string is
rarely the right thing to do, so UTF-8 is just fine as a native,
in-memory type.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Michael Torrie
On 07/14/2017 08:05 AM, Rhodri James wrote:
> On 14/07/17 14:31, Marko Rauhamaa wrote:
>> Of course, UTF-8 in a bytes object doesn't make the situation any
>> better, but does it make it any worse?
> 
> Speaking as someone who has been up to his elbows in this recently, I 
> would say emphatically that it does make things worse.  It adds an extra 
> layer of complexity to all of the questions you were asking, and more. 
> A single codepoint is a meaningful thing, even if its meaning may be 
> modified by combining.  A single byte may or may not be meaningful.

Are you saying that dealing with Unicode in Google Go, which uses UTF-8
in memory, is adding an extra layer of complexity and makes things worse
than they might be in Python?

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Rhodri James

On 14/07/17 15:32, Michael Torrie wrote:

On 07/14/2017 08:05 AM, Rhodri James wrote:

On 14/07/17 14:31, Marko Rauhamaa wrote:

Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?


Speaking as someone who has been up to his elbows in this recently, I
would say emphatically that it does make things worse.  It adds an extra
layer of complexity to all of the questions you were asking, and more.
A single codepoint is a meaningful thing, even if its meaning may be
modified by combining.  A single byte may or may not be meaningful.


Are you saying that dealing with Unicode in Google Go, which uses UTF-8
in memory, is adding an extra layer of complexity and makes things worse
than they might be in Python?


I'm not familiar with Go.  If the programmer has to be aware that the 
she is using UTF-8 under the hood, then yes, it does add an extra layer 
of complexity.  You have to remember the rules of UTF-8 as well as 
everything else.


--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Rhodri James

On 14/07/17 15:14, Marko Rauhamaa wrote:

Rhodri James :


On 14/07/17 14:31, Marko Rauhamaa wrote:

Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?


Speaking as someone who has been up to his elbows in this recently, I
would say emphatically that it does make things worse. It adds an
extra layer of complexity to all of the questions you were asking, and
more. A single codepoint is a meaningful thing, even if its meaning
may be modified by combining. A single byte may or may not be
meaningful.


I'd like to understand this better. Maybe you have a couple of examples
to share?


Sure.

What I've mostly been looking at recently has been the Expat XML parser. 
 XML chooses to deal with one of your problems by defining that it's 
not having anything to do with combining, sequences of codepoints are 
all you need to worry about when comparing strings.  U+00E8 (LATIN SMALL 
LETTER E WITH GRAVE) is not the same as U+0065 (LATIN SMALL LETTER E) 
followed by U+0300 (COMBINING GRAVE ACCENT) for example.


However Expat is written in C, and it reads in UTF-8 as a sequence of 
bytes.  There are endless checks all over the code that complete UTF-8 
byte sequences have been read in or passed across functional interfaces. 
 When you are dealing with a bytestream like this, you cannot assume 
that have complete codepoints, and you cannot find codepoint boundaries 
without searching along the string.  It's only once you have 
reconstructed the codepoint that you can tell what sort of character you 
have, and whether or not it is valid in your parsing context.


--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Chris Angelico
On Sat, Jul 15, 2017 at 12:32 AM, Michael Torrie  wrote:
> On 07/14/2017 08:05 AM, Rhodri James wrote:
>> On 14/07/17 14:31, Marko Rauhamaa wrote:
>>> Of course, UTF-8 in a bytes object doesn't make the situation any
>>> better, but does it make it any worse?
>>
>> Speaking as someone who has been up to his elbows in this recently, I
>> would say emphatically that it does make things worse.  It adds an extra
>> layer of complexity to all of the questions you were asking, and more.
>> A single codepoint is a meaningful thing, even if its meaning may be
>> modified by combining.  A single byte may or may not be meaningful.
>
> Are you saying that dealing with Unicode in Google Go, which uses UTF-8
> in memory, is adding an extra layer of complexity and makes things worse
> than they might be in Python?

Can you reverse a string in Go? How do you do it?

With Python, you can sometimes get tripped up, eg if you have:

* combining characters
* Arabic letters, which can look very different when reordered
* explicit directionality markers

But the semantics are at least easy to comprehend: you have a strict
reversal of code unit order. So you can reverse a string for parsing
purposes, and then re-reverse the subsections.

If you have a UTF-8 bytestring, a naive reversal will trip you up if
you have *any* non-ASCII values in there. You will have invalid UTF-8.
So *at very least*, your "reverse string" code has to be UTF-8 aware -
it has to keep continuation bytes with the correct start byte. And you
*still* have all the concerns that Python has.

Extra complexity. QED.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Write this accumuator in a functional style

2017-07-14 Thread Paul Rubin
Rustom Mody  writes:
> Yeah I know append method is supposedly O(1).

It's amortized O(1).
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Write this accumuator in a functional style

2017-07-14 Thread Steve D'Aprano
On Fri, 14 Jul 2017 09:06 am, Ned Batchelder wrote:

> Steve's summary is qualitatively right, but a little off on the quantitative
> details.  Lists don't resize to 2*N, they resize to ~1.125*N:
> 
> new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);
> 
> (https://github.com/python/cpython/blob/master/Objects/listobject.c#L49-L58)

Ah, thanks for the correction. I was going off vague memories of long-ago
discussion (perhaps even as long ago as Python 1.5!) when Tim Peters (I think
it was) described how list overallocation worked.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PYTHON GDAL

2017-07-14 Thread Fabien

On 07/14/2017 03:57 PM, jorge.conr...@cptec.inpe.br wrote:



Hi,

I installed the GDAL 2.2.1 using conda. Then I did:

import gdal

and I had:


Traceback (most recent call last):
   File "", line 1, in 
   File "/home/conrado/miniconda2/lib/python2.7/site-packages/gdal.py", 
line 2, in 

 from osgeo.gdal import deprecation_warn
   File 
"/home/conrado/miniconda2/lib/python2.7/site-packages/osgeo/__init__.py", line 
21, in 

 _gdal = swig_import_helper()
   File 
"/home/conrado/miniconda2/lib/python2.7/site-packages/osgeo/__init__.py", line 
17, in swig_import_helper

 _mod = imp.load_module('_gdal', fp, pathname, description)
ImportError: libicui18n.so.56: cannot open shared object file: No such 
file or directory



then I used the command find:

find . -name 'libicui18n.so.56' -print

and I had:

./usr/local/lib/python3.6/site-packages/PyQt5/Qt/lib/libicui18n.so.56


Please, what can I do to put (set) this lib for python2.7 recognize it.



Since you are using conda I *strongly* recommend to use the conda-forge 
channel to install GDAL:


conda install -c conda-forge gdal

https://conda-forge.org/





Thanks,


Conrado


--
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Steve D'Aprano
On Fri, 14 Jul 2017 11:31 pm, Marko Rauhamaa wrote:

> Steve D'Aprano :
> 
>> These are only a *few* of the *easy* questions that need to be
>> answered before we can even consider your question:
>>
>>> So the question is, should we have a third type for text. Or should
>>> the semantics of strings be changed to be based on characters?
> 
> Sure, but if they can't be answered, what good is there in having
> strings (as opposed to bytes). 

I didn't say they can't be answered. But however you answer them, you're going
to make somebody angry.

I notice you haven't given a definition for "character" yet. It's easy to be
critical and complain that Unicode strings don't handle "characters", but if
you can't suggest any improvements, then you're just bellyaching.

Do you have some concrete improvements in mind?


> What problem do strings solve?

Well, to start with it's a lot nicer to be able to write:


name = input("What is your name?")

instead of:

name = input("5768617420697320796f7572206e616d653f")

don't you think? I think that alone makes strings worth it.

And of course, I don't want to be limited to just US English, or one language at
a time. So we need a universal character set.


> What  
> operation depends on (or is made simpler) by having strings (instead of
> bytes)?

Code is written for people first, and to be executed by a computer only second.
So we want human-readable text to look as much like human-readable text.

Although I suppose computer keyboards would be a lot smaller if they only needed
16 keys marked 0...9ABCDEF instead of what we have now. We could program by
entering bytes:

6e616d65203d20696e70757428225768617420697320796f7572206e616d653f22290a7072696e742822596f7572206e616d652069732025722e222025206e616d6529

although debugging would be a tad more difficult, I expect. But the advantage
is, we'd have one less data type!

I mean, sure, *some* stick-in-the-mud old fashioned programmers would prefer to
write:

name = input("What is your name?")
print("Your name is %r." % name)

but I think your suggestion of eliminating strings and treating everything as
bytes has its advantages. For starters, everything is a one-liner!

Bytes, being a sequence of numbers, shouldn't define text operations like
converting to uppercase, regular expressions, and so forth. Of course the
Python 3 bytes data type does support some limited text operations, but that's
for backward compatibility with pre-Unicode Python, and its limited to ASCII.
If we were designing Python from scratch, I'd argue strongly against adding
text methods to a sequence of numbers.


> We are not even talking about some exotic languages, but the problem is
> right there in the middle of Latin-1. We can't even say what
>
> len("è")
> 
> should return.

Latin-1 predates Unicode, so this problem has existed for a long time. It's not
something that Unicode has introduced, it is inherent to the problem of dealing
with human language in its full generality.

Do you have a solution for this? How do you get WYSIWYG display of text without
violating the expectation that we should be able to count the length of a
string?

Before you answer, does your answer apply to Arabic and Thai as well as Western
European languages?


> And we may experience: 
> 
> >>> ord("è")Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: ord() expected a character, but string of length 2 found

You might, but only as a contrived example. You had to intentionally create a
decomposed string of length two as a string literal, and then call ord(). But
of course you knew that was going to happen -- its not something likely to
happen by accident. In practice, when you receive an arbitrary string, you test
its length before calling ord(). Or you walk the string calling ord() on each
code point.


> Of course, UTF-8 in a bytes object doesn't make the situation any
> better, but does it make it any worse?

Sure it does. You want the human reader to be able to predict the number of
graphemes ("characters") by sight. Okay, here's a string in UTF-8, in bytes:

e288b4c39fcf89e289a0d096e280b0e282ac78e2889e

How do you expect the human reader to predict the number of graphemes from a
UTF-8 hex string?

For the record, that's 44 hex digits or 22 bytes, to encode 9 graphemes. That's
an average of 2.44 bytes per grapheme. Would you expect the average programmer
to be able to predict where the grapheme breaks are?


> As it stands, we have
> 
>è --[encode>-- Unicode --[reencode>-- UTF-8

I can't even work out what you're trying to say here.



> Why is one encoding format better than the other?

It depends on what you're trying to do.

If you want to minimize storage and transmission costs, and don't care about
random access into the string, then UTF-8 is likely the best encoding, since it
uses as little as one byte per code point, and in practice with real-world text
(at least for Europeans) it is rarely more expensive than the alternat

Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Marko Rauhamaa
Steve D'Aprano :

> On Fri, 14 Jul 2017 11:31 pm, Marko Rauhamaa wrote:
>> Of course, UTF-8 in a bytes object doesn't make the situation any
>> better, but does it make it any worse?
>
> Sure it does. You want the human reader to be able to predict the
> number of graphemes ("characters") by sight. Okay, here's a string in
> UTF-8, in bytes:
>
> e288b4c39fcf89e289a0d096e280b0e282ac78e2889e
>
> How do you expect the human reader to predict the number of graphemes
> from a UTF-8 hex string?
>
> For the record, that's 44 hex digits or 22 bytes, to encode 9
> graphemes. That's an average of 2.44 bytes per grapheme. Would you
> expect the average programmer to be able to predict where the grapheme
> breaks are?
>
>> As it stands, we have
>> 
>>è --[encode>-- Unicode --[reencode>-- UTF-8
>
> I can't even work out what you're trying to say here.

I can tell, yet that doesn't prevent you from dismissing what I'm
saying.

>> Why is one encoding format better than the other?
>
> It depends on what you're trying to do.
>
> If you want to minimize storage and transmission costs, and don't care
> about random access into the string, then UTF-8 is likely the best
> encoding, since it uses as little as one byte per code point, and in
> practice with real-world text (at least for Europeans) it is rarely
> more expensive than the alternatives.

Python3's strings don't give me any better random access than UTF-8.

Storage and transmission costs are not an issue. It's only that storage
and transmission are still defined in terms of bytes. Python3's strings
force you to encode/decode between strings and bytes for a
yet-to-be-specified advantage.

> It also has the advantage of being backwards compatible with ASCII, so
> legacy applications that assume all characters are a single byte will
> work if you use UTF-8 and limit yourself to the ASCII-compatible
> subset of Unicode.

UTF-8 is perfectly backward-compatible with ASCII.

> The disadvantage is that each code point can be one, two, three or
> four bytes wide, and naively shuffling bytes around will invariably
> give you invalid UTF-8 and cause data loss. So UTF-8 is not so good as
> the in-memory representation of text strings.

The in-memory representation is not an issue. It's the abstract
semantics that are the issue.

At the abstract level, we have the text in a human language. Neither
strings nor UTF-8 provide that so we have to settle for something
cruder. I have yet to hear why a string does a better job than UTF-8.

> If you have lots of memory, then UTF-32 is the best for in-memory
> representation, because its a fixed-width encoding and parsing it is
> simple. Every code point is just four bytes and you an easily
> implement random access into the string.

The in-memory representation is not an issue. It's the abstract
semantics that are the issue.

> If you want a reasonable compromise, UTF-16 is quite decent. If you're
> willing to limit yourself to the first 2**16 code points of Unicode,
> you can even pretend that its a fixed width encoding like UTF-32.

UTF-16 (used by Windows and Java, for example) is even worse than
strings and UTF-8 because:

è --[encode>-- Unicode --[reencode>-- UTF-16 --[reencode>-- bytes

> If you have to survive transmission through machines that require
> 7-bit clean bytes, then UTF-7 is the best encoding to use.

I don't know why that is coming into this discussion.

So no raison-d'être has yet been offered for strings.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Neil Cerutti
On 2017-07-14, Rhodri James  wrote:
> On 14/07/17 15:32, Michael Torrie wrote:
>> Are you saying that dealing with Unicode in Google Go, which
>> uses UTF-8 in memory, is adding an extra layer of complexity
>> and makes things worse than they might be in Python?
>
> I'm not familiar with Go.  If the programmer has to be aware
> that the she is using UTF-8 under the hood, then yes, it does
> add an extra layer of complexity.  You have to remember the
> rules of UTF-8 as well as everything else.

Go represents strings as sequences of bytes. It provides separate
API's that allow you to regard those bytes as either plain old
bytes, or as a sequence of runes (not-necessarily normalized
codepoints). If your bytes strings aren't in UTF-8, then Go Away.

https://blog.golang.org/strings

-- 
Neil Cerutti

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Marko Rauhamaa
Rhodri James :

> On 14/07/17 15:14, Marko Rauhamaa wrote:
>> I'd like to understand this better. Maybe you have a couple of
>> examples to share?
>
> Sure.
>
> What I've mostly been looking at recently has been the Expat XML parser.
> XML chooses to deal with one of your problems by defining that it's not
> having anything to do with combining, sequences of codepoints are all
> you need to worry about when comparing strings.  U+00E8 (LATIN SMALL
> LETTER E WITH GRAVE) is not the same as U+0065 (LATIN SMALL LETTER E)
> followed by U+0300 (COMBINING GRAVE ACCENT) for example.

Very interesting. The relevant W3C spec confirms what you said:

  5. Test the resulting sequences of code points bit-by-bit for identity.

  [...]

  This document therefore recommends, when possible, that all content be
  stored and exchanged in Unicode Normalization Form C (NFC).

  https://www.w3.org/TR/charmod-norm/>


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Marko Rauhamaa
Michael Torrie :

> On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:
>> Of course, UTF-8 in a bytes object doesn't make the situation any
>> better, but does it make it any worse?
>> 
>> As it stands, we have
>> 
>>è --[encode>-- Unicode --[reencode>-- UTF-8
>> 
>> Why is one encoding format better than the other?
>
> This is precisely the logic behind Google using UTF-8 for strings in
> Go, rather than having some O(1) abstract type like Python has. And
> many other languages do the same. The argument is that because of the
> very issues that you mention, having O(1) lookup in a string isn't
> that important, since looking up a particular index in a unicode
> string is rarely the right thing to do, so UTF-8 is just fine as a
> native, in-memory type.

It pays to come in late.

Windows NT and Java evaded the 8-bit localization nightmare by going
UCS-2.

Python3 managed not to repeat the earlier UCS-2 blunders by going all
the way to UCS-4.

Go saw the futility of UCS-4 as a separate data type and dropped down to
UTF-8.

Unfortunately, Guile is following in Python3's footsteps.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Terry Reedy

On 7/14/2017 10:30 AM, Michael Torrie wrote:

On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:

Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?




As it stands, we have

è --[encode>-- Unicode --[reencode>-- UTF-8

Why is one encoding format better than the other?


All digital data are ultimately bits, usually collected together in 
groups of 8, called bytes.  The point of python 3 is that text should 
normally be instances of a text class, separate from the raw bytes 
class,  with a defined internal encoding.  The actual internal encoding 
is secondary.  And it changed in 3.3.


Python ints are encoded bytes, so are floats, and everything else.  When 
one prints a float, one certainly does not see a representation of the 
raw bytes in the float object.  Instead, one sees a representation of 
the value it represents. There is a proposal to change the internal 
encoding of int, as least on 64-bit machines, which are now standard. 
However, because print(87987282738472387429748) prints 
87987282738472387429748 and not the internal bytes, the change in the 
internal bytes will not affect the user view of ints.



This is precisely the logic behind Google using UTF-8 for strings in Go,
rather than having some O(1) abstract type like Python has.  And many
other languages do the same.  The argument is that because of the very
issues that you mention, having O(1) lookup in a string isn't that
important, since looking up a particular index in a unicode string is
rarely the right thing to do, so UTF-8 is just fine as a native,
in-memory type.


Does go use bytes for text, like most people did in Python 2, a separate 
text string class, that hides the internal encoding format and 
implementation?  In other words, if you do the equivalent of print(s) 
where s is a text string with a mixture of greek, cyrillic, hindi, 
chinese, japanese, and korean chars, do you see the characters, or some 
representation of the internal bytes?



--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Marko Rauhamaa
Terry Reedy :

> On 7/14/2017 10:30 AM, Michael Torrie wrote:
>> On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:
>>> Of course, UTF-8 in a bytes object doesn't make the situation any
>>> better, but does it make it any worse?
>>
>>>
>>> As it stands, we have
>>>
>>> è --[encode>-- Unicode --[reencode>-- UTF-8
>>>
>>> Why is one encoding format better than the other?
>
> All digital data are ultimately bits, usually collected together in
> groups of 8, called bytes.

Naturally.

> The point of python 3 is that text should normally be instances of a
> text class, separate from the raw bytes class, with a defined internal
> encoding.

And I called its usefulness into question.

>> This is precisely the logic behind Google using UTF-8 for strings in Go,
>> rather than having some O(1) abstract type like Python has.  And many
>> other languages do the same.  The argument is that because of the very
>> issues that you mention, having O(1) lookup in a string isn't that
>> important, since looking up a particular index in a unicode string is
>> rarely the right thing to do, so UTF-8 is just fine as a native,
>> in-memory type.
>
> Does go use bytes for text, like most people did in Python 2,

Yes. Also, C and the GNU textutils do that.

> a separate text string class, that hides the internal encoding format
> and implementation? In other words, if you do the equivalent of
> print(s) where s is a text string with a mixture of greek, cyrillic,
> hindi, chinese, japanese, and korean chars, do you see the characters,
> or some representation of the internal bytes?

Yes, in Python2, Go, C and GNU textutils, when you print a text string
containing a mixture of languages, you see characters.

Why?

Because that's what the terminal emulator chooses to do upon receiving
those bytes.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Terry Reedy

On 7/14/2017 5:51 PM, Marko Rauhamaa wrote:


Yes, in Python2, Go, C and GNU textutils, when you print a text string
containing a mixture of languages, you see characters.

Why?

Because that's what the terminal emulator chooses to do upon receiving
those bytes.


>>> s = u'\u1171\u\u\u\u'
>>> s
u'\u1171\u\u\u\u'
>>> print(s)
ᅱ∢㌳䑄啕
>>> b = s.encode('utf-8')
>>> b
'\xe1\x85\xb1\xe2\x88\xa2\xe3\x8c\xb3\xe4\x91\x84\xe5\x95\x95'
>>> print(b)
ᅱ∢㌳䑄啕

I prefer the accurate 5 char print of the text string to the print of 
the bytes.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


pyserial and end-of-line specification

2017-07-14 Thread F S
I just started using Python and I am writing code to access my serial port 
using pyserial. I have no problem with unix based text coming in the stream 
using a LF (0x0A) record separator. I also am using unblocked IO. However I 
have some sensor devices that use the windows CRLF (0x0A,0x0D) record 
separators and also a 0x03 and 0x02 (stx,etx) framing so I need to change the 
EOL (end of line) specfier in order to get the pyserial readline to so this.

I read the doc page for pyserial and they allude to using TextIOWrapper: to 
accomplish this however the example is very unclear and I could not find better 
information on the IO page.

I would appreciate any advice on how to block the records using "x0Ax0D" and 
"x03".

Thanks
Fritz
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Steve D'Aprano
On Sat, 15 Jul 2017 07:12 am, Terry Reedy wrote:

> Does go use bytes for text, like most people did in Python 2, a separate
> text string class, that hides the internal encoding format and
> implementation?  In other words, if you do the equivalent of print(s)
> where s is a text string with a mixture of greek, cyrillic, hindi,
> chinese, japanese, and korean chars, do you see the characters, or some
> representation of the internal bytes?

The answer is, its complicated.

Go has two string types: "strings", and "runes".

Strings are equivalent to Python 3 byte-strings, except that the language is
biased towards assuming they are UTF-8 instead of Python 3's decision to assume
they are ASCII. In other words, if you display a Python 3 byte-string, it will
display bytes that represent ASCII characters as ASCII, and everything else
escaped as a hex byte:

py> b'\x41\xcf\x80\x5a'
b'A\xcf\x80Z'

Go does the same, except it will display anything which is legal UTF-8 (which
may be 1, 2, 3, or 4 bytes) as a Unicode character (actually code point).
Assuming your environment is capable of displaying that character, otherwise
you'll just see a square, or some other artifact.

So if Python used the same rules as Go, the above byte-string would display as:

b'AπZ'

Most of the time, when processing strings, Go treats them as arbitrary bytes,
although Go comes with libraries that help make it easier to work with them as
UTF-8 byte strings.

Runes, on the other hand, are a strict superset of Unicode. Runes are strings of
32-bit code units, so like UTF-32 except not limited to the Unicode range of
\U through \U0010. Runes will accept any 32 bit values up to
0x.

I presume that runes which fall within the UTF-32 range will be displayed as the
Unicode character where possible, and those which fall outside of that range as
some sort of hex display.

So Go strings are like Python byte strings, biased towards UTF-8 but with no
guarantees made, and Go runes are a superset of Python text strings.

Does that answer your question sufficiently?

https://blog.golang.org/strings


-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Steve D'Aprano
On Sat, 15 Jul 2017 04:10 am, Marko Rauhamaa wrote:

> Steve D'Aprano :
>> On Fri, 14 Jul 2017 11:31 pm, Marko Rauhamaa wrote:
[...]
>>> As it stands, we have
>>> 
>>>è --[encode>-- Unicode --[reencode>-- UTF-8
>>
>> I can't even work out what you're trying to say here.
> 
> I can tell, yet that doesn't prevent you from dismissing what I'm
> saying.

How am I dismissing it? I didn't reply to it except to say I don't understand
it! To me, it looks like gibberish, not even wrong, but rather than say so I
thought I'd give you the opportunity to explain what you meant.

As the person attempting to communicate, any failure to do so is *your*
responsibility, not that of the reader. If you are discussing this in good
faith, rather than as a cheap points-scoring exercise, then please try to
explain what you mean.


>>> Why is one encoding format better than the other?
>>
>> It depends on what you're trying to do.
>>
>> If you want to minimize storage and transmission costs, and don't care
>> about random access into the string, then UTF-8 is likely the best
>> encoding, since it uses as little as one byte per code point, and in
>> practice with real-world text (at least for Europeans) it is rarely
>> more expensive than the alternatives.
> 
> Python3's strings don't give me any better random access than UTF-8.

Say what? Of course they do.

Python 3 strings (since 3.3) are a compact form of UTF-32. Without loss of
generality, we can say that each string is an array of four-byte code units.

(In practice, depending on the string, Python may be able to compact that to
one- or two-byte code units.)

The critical thing is that slicing and indexing is a constant-time operation.
string[i] can just jump straight to offset i code-units into the array. If the
code-units are 4 bytes wide, that's just 4*i bytes.

UTF-8 is not: it is a variable-width encoding, so there's no way to tell how
many bytes it takes to get to string[i]. You have to start at the beginning of
the string and walk the bytes, counting code points, until you reach the i-th
code point.

It may be possible to swap memory for time by building an augmented data
structure that makes this easier. A naive example would be to have a separate
array giving the offsets of each code point. But then its not a string any
more, its a more complex data structure.

Go ignores this problem by simply not offering random access to code points in
strings. Go simply says that strings are bytes, and if string[i] jumps into the
middle of a character (code point), oh well, too bad, so sad.

On the other hand, Go also offers a second solution to the problem. Its
essentially the same solution that Python offers: a dedicated fixed-width,
32-bit (four byte) Unicode text string type which they call "runes".


> Storage and transmission costs are not an issue.

I was giving a generic answer to a generic question. You asked a general
question, "Why is one encoding format better than the other?" and the general
answer to that is *it depends on what you are trying to do*.


> It's only that storage and transmission are still defined in terms of bytes.

Again, I don't see what point you think you are making here. Ultimately, all our
data structures have to be implemented in memory which is addressable in bytes.
*All of them* -- objects, linked lists, floats, BigInts, associative arrays,
red-black trees, the lot.

All of those data structures are presented to the programmer in terms of higher
level abstractions. You seem to think that text strings alone don't need that
higher level abstraction, and that the programmer ought to think about text in
terms of bytes. Why?

You entered this discussion with a reasonable position: the text primitives
offered to programmers fall short of what we'd like, which is to deal with
language in terms of language units: characters specifically. (Let's assume we
can decide what a character actually is.) I agree! If Python's text strings are
supposed to be an abstraction for "strings of characters", its a leaky
abstraction. It's actually "strings of code points".

Some people might have said:

"Since Python strings fall short of the abstraction we would like, we should
build a better abstraction on top of it, using Unicode primitives, that deals
with characters (once we decide what they are)."

which is where I thought you were going with this. But instead, you've suggested
that the solution to the problem:

"Python strings don't come close enough to matching the programmer's
expectations about characters"

is to move *further away* from the programmer's expectations about characters
and to have them reason about UTF-8 encoded bytes instead.

And then to insult our intelligence even further, after raising the in-memory
representation (UTF-8 versus some other encoding) to prominence, you then
repeatedly said that the in-memory representation doesn't matter!

If it doesn't matter, why do you care whether strings use UTF-8 or UTF-32 or
something else?



> Py

Re: Grapheme clusters, a.k.a.real characters

2017-07-14 Thread Terry Reedy

On 7/14/2017 9:20 PM, Steve D'Aprano wrote:

On Sat, 15 Jul 2017 07:12 am, Terry Reedy wrote:


Does go use bytes for text, like most people did in Python 2, a separate
text string class, that hides the internal encoding format and
implementation?  In other words, if you do the equivalent of print(s)
where s is a text string with a mixture of greek, cyrillic, hindi,
chinese, japanese, and korean chars, do you see the characters, or some
representation of the internal bytes?


The answer is, its complicated.

Go has two string types: "strings", and "runes".

Strings are equivalent to Python 3 byte-strings, except that the language is
biased towards assuming they are UTF-8 instead of Python 3's decision to assume
they are ASCII. In other words, if you display a Python 3 byte-string, it will
display bytes that represent ASCII characters as ASCII, and everything else
escaped as a hex byte:

py> b'\x41\xcf\x80\x5a'
b'A\xcf\x80Z'

Go does the same, except it will display anything which is legal UTF-8 (which
may be 1, 2, 3, or 4 bytes) as a Unicode character (actually code point).
Assuming your environment is capable of displaying that character, otherwise
you'll just see a square, or some other artifact.

So if Python used the same rules as Go, the above byte-string would display as:

b'AπZ'

Most of the time, when processing strings, Go treats them as arbitrary bytes,
although Go comes with libraries that help make it easier to work with them as
UTF-8 byte strings.

Runes, on the other hand, are a strict superset of Unicode. Runes are strings of
32-bit code units, so like UTF-32 except not limited to the Unicode range of
\U through \U0010. Runes will accept any 32 bit values up to
0x.

I presume that runes which fall within the UTF-32 range will be displayed as the
Unicode character where possible, and those which fall outside of that range as
some sort of hex display.

So Go strings are like Python byte strings, biased towards UTF-8 but with no
guarantees made, and Go runes are a superset of Python text strings.

Does that answer your question sufficiently?

https://blog.golang.org/strings


Yes, thank you.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list