date:20170716

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Rustom Mody

The first book I studied as a CS-student was Structured Computer Organization 
by Tanenbaum

Apart from the detailed description of various machines like PDP-11, IBM-360
etc it suggested the understanding of the computer at 4 levels:
- Microprogramming level
- "Conventional" machine level (nowadays called ISA)
- OS level -- where system calls become new "instructions"
- HLL level of languages (like PL-1 !)

[The next edition would add the digital abstraction level below the 
microprogamming level]

For me as for many in my generation this book and this leveled view
was an important component in my understanding of CS

A few years later I studied a course on something called "networks and 
networking"
Again it talked of some 7 (OSI) layers
But it didnt make much sense to someone whose only idea of a network was the
wire that connected the terminal to the (pretending) mainframe

In a subsequent edition of Networking, I found that Tanenbaum had castigated
the 7 OSI layers as useless and unnecessary with the 3 TCP layers being
more realisitc

Still further(?) editions, he would introduce 5 layers as a hybrid between the
international but failed OSI standard and the ubiquitous but incomplete TCP
standard

Why am I saying all this?

A layered understanding is the bedrock of our field
Except that sometimes it works
And sometimes it doesn't

The 3 layers here are
- UTF-8 layer
- Unicode codepoint layer
- Linguistically useful (grapheme) layer

Marko's statements like UTF-8 is random access is so obviously wrong that
(my guess) is that he is not meaning it literally but elliptically as saying:
"This excessive layering is not working"

OTOH statements like level 2 is 90% good enough for level 3
is in the same ludicrous class as "The world is as wide as the Atlantic ocean" 
As pointed out above, agglutinating letters is the norm not the exception in 
the 
world's languages upto and including (latin in) English

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: is @ operator popular now?

2017-07-16 Thread Terry Reedy


On 7/15/2017 7:35 AM, oyster wrote:

as the title says. has @ been used in projects?


@ was added as an operator for the benefit of numpy, which is a hugh 
project.  I am pretty sure that it is used there, but you can ask on 
some numpy list.



--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Marko Rauhamaa

Mikhail V :

> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
>> Random access to code points is as uninteresting as random access to
>> UTF-8 bytes. I might want random access to the "Grapheme clusters,
>> a.k.a.real characters".
>
> What _real_ characters are you referring to?
> If your data has "á" (U00E1), then it is one real character,
> if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
> real characters. So in both cases you have access to code points =
> real characters.

It's true that confusion is caused by the ambiguity of the term
"character."

> For metaphysical discussion - in _my_ definition there is no such
> "real" character as "á", since it is the "a" glyph with some dirt, so
> according to my definition, it should be two separate characters, both
> semantically and technically seen.

Here's the problem: when the human user types in "á" (with one, two or
three keyclicks), they don't know how the computer represents it
internally. The Unicode standard allows for two *equivalent* code point
sequences (https://en.wikipedia.org/wiki/Unicode_equivalence>).
When the computer outputs the sequence, the visible result is the single
letter "á". The human user doesn't know—or care—about the internal
representation.

The user's expectation is that the visible letter "á" should behave like
any other single letter. For example, a text editor should move the
cursor past it with a single click of a left or right arrow key. Also,
if I perform a regular-expression search in the editor and look for

   Alv[aá]rez

I should get a match with either Alvarez or Alvárez.

> And, in my definition, the whole Unicode is a huge junkyard, to start
> with.

I don't think anybody denies that. However, it's the best thing
available and—more importantly—a universally accepted standard.

> But opinions may vary, and in case you prefer or forced to write "á",
> then it can be impractical to store it as two characters, regardless
> of encoding.

Now I'm not following you.

Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Connecting Google News

2017-07-16 Thread Javier Bezos

Google News used to fail with the high level functions provided by httplib 
and the like. However, I found this piece of code somewhere:


def gopen():
  http = httplib.HTTPSConnection('news.google.com')
  http.request("GET","/news?ned=es_MX" ,
headers =
   {"User-Agent":"Mozilla/5.0 (X11; U; Linux i686; es-MX) 
AppleWebKit/532.8 (KHTML, like Gecko) Chrome/4.0.277.0 Safari/532.8",

   "Host":'news.google.com',
   "Accept": "*/*"})
  return http.getresponse()

A few days ago, Google News has been revamped and it doesn't work any more 
(2.6/Win7, 2.7/OSX and, with minimal changes, 3.6/Win7), because the page 
contents is empty. The code itself doesn't raise any errors. Which is the 
proper way to do it now? I must stick to the standard libraries.


The returned headers are:

--
[('Content-Type', 'application/binary'),
 ('Cache-Control', 'no-cache, no-store, max-age=0, must-revalidate'),
 ('Pragma', 'no-cache'),
 ('Expires', 'Mon, 01 Jan 1990 00:00:00 GMT'),
 ('Date', 'Thu, 13 Jul 2017 16:37:48 GMT'),
 ('Location', 'https://news.google.com/news/?ned=es_mx&hl=es'),
 ('Strict-Transport-Security', 'max-age=10886400'),
 ('P3P',
  'CP="This is not a P3P policy! See '
 'https://support.google.com/accounts/answer/151657?hl=en for more 
info."'),

 ('Server', 'ESF'),
 ('Content-Length', '0'),
 ('X-XSS-Protection', '1; mode=block'),
 ('X-Frame-Options', 'SAMEORIGIN'),
 ('X-Content-Type-Options', 'nosniff'),
 ('Set-Cookie', 
'NID=107=qwH7N2hB12zVGfFzrAC2CZZNhrnNAVLEmTvDvuSzzw6mSlta9D2RDZVP9t5gEcq_WJjZQjDSWklJ7LElSnAZnHsiF4CXOwvGDs2tjrXfP41LE-6LafdA86GO3sWYnfWs;Domain=.google.com;Path=/;Expires=Fri, 
'

 '12-Jan-2018 16:37:48 GMT;HttpOnly'),
 ('Alt-Svc', 'quic=":443"; ma=2592000; v="39,38,37,36,35"')]
---

`read()` is empty string ('' or b''). `status` is 302. `reason` is `Found`.

Javier
--
https://mail.python.org/mailman/listinfo/python-list

Re: Connecting Google News

2017-07-16 Thread Peter Otten

Javier Bezos wrote:

> Google News used to fail with the high level functions provided by httplib
> and the like. However, I found this piece of code somewhere:
> 
>  def gopen():
>http = httplib.HTTPSConnection('news.google.com')
>http.request("GET","/news?ned=es_MX" ,

When you change that to

 http.request("GET","/news/headlines?ned=es_mx&hl=es" ,

you get a non-empty return. Most of the actual content seems to be buried in 
javascript though.

>  headers =
> {"User-Agent":"Mozilla/5.0 (X11; U; Linux i686; es-MX)
> AppleWebKit/532.8 (KHTML, like Gecko) Chrome/4.0.277.0 Safari/532.8",
> "Host":'news.google.com',
> "Accept": "*/*"})
>return http.getresponse()
> 
> A few days ago, Google News has been revamped and it doesn't work any more
> (2.6/Win7, 2.7/OSX and, with minimal changes, 3.6/Win7), because the page
> contents is empty. The code itself doesn't raise any errors. Which is the
> proper way to do it now? I must stick to the standard libraries.


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Connecting Google News

2017-07-16 Thread Chris Warrick

On 16 July 2017 at 11:26, Javier Bezos  wrote:
> Google News used to fail with the high level functions provided by httplib
> and the like. However, I found this piece of code somewhere:
>
> def gopen():
>   http = httplib.HTTPSConnection('news.google.com')
>   http.request("GET","/news?ned=es_MX" ,
> headers =
>{"User-Agent":"Mozilla/5.0 (X11; U; Linux i686; es-MX)
> AppleWebKit/532.8 (KHTML, like Gecko) Chrome/4.0.277.0 Safari/532.8",
>"Host":'news.google.com',
>"Accept": "*/*"})
>   return http.getresponse()
>
> A few days ago, Google News has been revamped and it doesn't work any more
> (2.6/Win7, 2.7/OSX and, with minimal changes, 3.6/Win7), because the page
> contents is empty. The code itself doesn't raise any errors. Which is the
> proper way to do it now? I must stick to the standard libraries.

Why? The Python standard library doesn’t have anything good for HTTP.
* httplib is fairly low-level, and it does not support something as
basic as redirects;
* urllib.request (urllib2 in Python 2) is slightly better;
* but even the official docs for both redirect to requests:
http://docs.python-requests.org/en/master/ for a high level interface.

(Also, please upgrade your Windows box to run Python 2.7.)

> The returned headers are:
>
> --
> [('Content-Type', 'application/binary'),
>  ('Cache-Control', 'no-cache, no-store, max-age=0, must-revalidate'),
>  ('Pragma', 'no-cache'),
>  ('Expires', 'Mon, 01 Jan 1990 00:00:00 GMT'),
>  ('Date', 'Thu, 13 Jul 2017 16:37:48 GMT'),
>  ('Location', 'https://news.google.com/news/?ned=es_mx&hl=es'),
>  ('Strict-Transport-Security', 'max-age=10886400'),
>  ('P3P',
>   'CP="This is not a P3P policy! See '
>  'https://support.google.com/accounts/answer/151657?hl=en for more
> info."'),
>  ('Server', 'ESF'),
>  ('Content-Length', '0'),
>  ('X-XSS-Protection', '1; mode=block'),
>  ('X-Frame-Options', 'SAMEORIGIN'),
>  ('X-Content-Type-Options', 'nosniff'),
>  ('Set-Cookie',
> 'NID=107=qwH7N2hB12zVGfFzrAC2CZZNhrnNAVLEmTvDvuSzzw6mSlta9D2RDZVP9t5gEcq_WJjZQjDSWklJ7LElSnAZnHsiF4CXOwvGDs2tjrXfP41LE-6LafdA86GO3sWYnfWs;Domain=.google.com;Path=/;Expires=Fri,
> '
>  '12-Jan-2018 16:37:48 GMT;HttpOnly'),
>  ('Alt-Svc', 'quic=":443"; ma=2592000; v="39,38,37,36,35"')]
> ---
>
> `read()` is empty string ('' or b''). `status` is 302. `reason` is `Found`.

https://en.wikipedia.org/wiki/HTTP_302

See that Location header? The web server wants to redirect you
somewhere. Your low-level HTTP library does not handle redirects
automatically, so you’d need to take care of that yourself.

-- 
Chris Warrick 
PGP: 5EAAEA16
-- 
https://mail.python.org/mailman/listinfo/python-list

Difference in behavior of GenericMeta between 3.6.0 and 3.6.1

2017-07-16 Thread Oren Ben-Kiki

TL;DR: We need improved documentation of the way meta-classes behave for
generic classes, and possibly reconsider the way "__setattr__" and
"__getattribute__" behave for such classes.

I am using meta-programming pretty heavily in one of my projects.
It took me a while to figure out the dance between meta-classes and generic
classes in Python 3.6.0.

I couldn't find good documentation for any of this (if anyone has a good
link, please share...), but with a liberal use of "print" I managed to
reverse engineer how this works. The behavior isn't intuitive but I can
understand the motivation (basically, "type annotations shall not change
the behavior of the program").

For the uninitiated:

* It turns out that there are two kinds of instances of generic classes:
the "unspecialized" class (basically ignoring type parameters), and
"specialized" classes (created when you write "Foo[Bar]", which know the
type parameters, "Bar" in this case).

* This means the meta-class "__new__" method is called sometimes to create
the unspecialized class, and sometimes to create a specialized one - in the
latter case, it is called with different arguments...

* No object is actually an instance of the specialized class; that is, the
"__class__" of an instance of "Foo[Bar]" is actually the unspecialized
"Foo" (which means you can't get the type parameters by looking at an
instance of a generic class).

So far, so good, sort of. I implemented my meta-classes to detect whether
they are creating a "specialized" or "unspecialized" class and behave
accordingly.

However, these meta-classes stopped working when switching to Python 3.6.1.
The reason is that in Python 3.6.1, a "__setattr__" implementation was
added to "GenericMeta", which redirects the setting of an attribute of a
specialized class instance to set the attribute of the unspecialized class
instance instead.

This causes code such as the following (inside the meta-class) to behave in
a mighty confusing way:

if is-not-specialized:
cls._my_attribute = False
else:  # Is specialized:
cls._my_attribute = True
assert cls._my_attribute  # Fails!

As you can imagine, this caused us some wailing and gnashing of teeth,
until we figured out (1) that this was the problem and (2) why it was
happening.

Looking into the source code in "typing.py", I see that I am not the only
one who had this problem. Specifically, the implementers of the "abc"
module had the exact same problem. Their solution was simple: the
"GenericMeta.__setattr__" code explicitly tests whether the attribute name
starts with "_abc_", in which case it maintains the old behavior.

Obviously, I should not patch the standard library typing.py to preserve
"_my_attribute". My current workaround is to derive from GenericMeta,
define my own "__setattr__", which preserves the old behavior for
"_my_attribute", and use that instead of the standard GenericMeta
everywhere.

My code now works in both 3.6.0 and 3.6.1. However, I think the following
points are worth fixing and/or discussion:

* This is a breaking change, but it isn't listed in
https://www.python.org/downloads/release/python-361/ - it should probably
be listed there.

* In general it would be good to have some documentation on the way that
meta-classes and generic classes interact with each other, as part of the
standard library documentation (apologies if it is there and I missed it...
link?)

* I'm not convinced the new behavior is a better default. I don't recall
seeing a discussion about making this change, possibly I missed it (link?)

* There is a legitimate need for the old behavior (normal per-instance
attributes). For example, it is needed by the "abc" module (as well as my
project). So, some mechanism should be recommended (in the documentation)
for people who need the old behavior.

* Separating between "really per instance" attributes and "forwarded to the
unspecialized instance" attributes based on their prefix seems to violate
"explicit is better than implicit". For example, it would have been
explicit to say "cls.__unspecialized__.attribute" (other explicit
mechanisms are possible).

* Perhaps the whole notion of specialized vs. unspecialized class instances
needs to be made more explicit in the GenericMeta API...

* Finally and IMVHO most importantly, it is *very* confusing to override
"__setattr__" and not override "__getattribute__" to match. This gives rise
to code like "cls._foo = True; assert cls._foo" failing. This feels
wrong And presumably fixing the implementation so that
"__getattribute__" forwards the same set of attributes to the
"unspecialized" instance wouldn't break any code... Other than code that
already broken due to the new functionality, that is.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Connecting Google News

2017-07-16 Thread Javier Bezos


Chris,


(Also, please upgrade your Windows box to run Python 2.7.)


It's not /my/ Windows box. I'm allowed to run my script, that's
all. My Windows box is actually that with 3.6.


  http = httplib.HTTPSConnection('news.google.com')
  http.request("GET","/news?ned=es_MX" ,



  ('Location', 'https://news.google.com/news/?ned=es_mx&hl=es'),

...


See that Location header? The web server wants to redirect you
somewhere. Your low-level HTTP library does not handle redirects
automatically, so you’d need to take care of that yourself.


I didn't notice the bar just before ?ned ! I don't know how many
times I've compared the URLs without realizing it was added. Silly
me!

Thank you
Javier


--
https://mail.python.org/mailman/listinfo/python-list

Re: Connecting Google News

2017-07-16 Thread Javier Bezos


Peter,


  http.request("GET","/news/headlines?ned=es_mx&hl=es" ,


Thank you. It works, too.

Javier

--
https://mail.python.org/mailman/listinfo/python-list

Re: Connecting Google News

2017-07-16 Thread Skip Montanaro

Peter> Most of the actual content seems to be buried in javascript though.

Peeking at it, almost all of the useful content appears to be data. It
doesn't seem like snipping it out and interpreting it as JSON would be
terribly difficult. Perhaps no JS engine required.

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Difference in behavior of GenericMeta between 3.6.0 and 3.6.1

2017-07-16 Thread Peter Otten

Oren Ben-Kiki wrote:

> TL;DR: We need improved documentation of the way meta-classes behave for
> generic classes, and possibly reconsider the way "__setattr__" and
> "__getattribute__" behave for such classes.

The typing module is marked as "provisional", so you probably have to live 
with the incompatibilities.

As to your other suggestions/questions, I'm not sure where the actual 
discussion is taking place -- roughly since the migration to github python-
dev and bugs.python.org are no longer very useful for outsiders to learn 
what's going on.

A random walk over the github site found

https://github.com/python/typing/issues/392

Maybe you can make sense of that? 

Personally, I'm not familiar with the evolving type system and still 
wondering whether I should neglect or reject...

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Difference in behavior of GenericMeta between 3.6.0 and 3.6.1

2017-07-16 Thread Oren Ben-Kiki

Yes, it sort-of makes sense... I'll basically re-post my question there.

Thanks for the link!

Oren.


On Sun, Jul 16, 2017 at 4:29 PM, Peter Otten <__pete...@web.de> wrote:

> Oren Ben-Kiki wrote:
>
> > TL;DR: We need improved documentation of the way meta-classes behave for
> > generic classes, and possibly reconsider the way "__setattr__" and
> > "__getattribute__" behave for such classes.
>
> The typing module is marked as "provisional", so you probably have to live
> with the incompatibilities.
>
> As to your other suggestions/questions, I'm not sure where the actual
> discussion is taking place -- roughly since the migration to github python-
> dev and bugs.python.org are no longer very useful for outsiders to learn
> what's going on.
>
> A random walk over the github site found
>
> https://github.com/python/typing/issues/392
>
> Maybe you can make sense of that?
>
> Personally, I'm not familiar with the evolving type system and still
> wondering whether I should neglect or reject...
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Rick Johnson

On Sunday, July 16, 2017 at 2:55:57 AM UTC-5, Marko Rauhamaa wrote:
> Mikhail V :
> > On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
> > >
> > > Random access to code points is as uninteresting as
> > > random access to UTF-8 bytes. I might want random access
> > > to the "Grapheme clusters, a.k.a.real characters".
> >
> > What _real_ characters are you referring to? If your data
> > has "á" (U00E1), then it is one real character, if you
> > have "a" (U0061) and "ˊ" (U02CA) then it is _two_ real
> > characters. So in both cases you have access to code
> > points = real characters.
> 
> It's true that confusion is caused by the ambiguity of the
> term "character."
> 
> > For metaphysical discussion - in _my_ definition there is
> > no such "real" character as "á", since it is the "a" glyph
> > with some dirt, so according to my definition, it should
> > be two separate characters, both semantically and
> > technically seen.
> 
> Here's the problem: when the human user types in "á" (with
> one, two or three keyclicks), they don't know how the
> computer represents it internally. The Unicode standard
> allows for two *equivalent* code point sequences ( https://en.wikipedia.org/wiki/Unicode_equivalence>). When
> the computer outputs the sequence, the visible result is
> the single letter "á". The human user doesn't know—or
> care—about the internal representation.

*EXACTLY*. But your statement is far too general. Not only
need not the _human_user_ be concerned with these low level
aspects of strings, but the _programmer_ need not be concerned
either. The programmer should only see strings from a
practical standpoint:

"Can i index the chars within them?"

"Can i determine the length of them?"

"Can i slice and dice and combine them?"

"Can i trust that the character positions will maintain
order?"

"Can i, and my target users, display them in a human
readable form using various rendering specifications defined
by graphic designers (aka: font-o-philes)?"

If the answer to all of these questions is *YES*, then you
know all you need to know about strings. Now get to work!!!

> The user's expectation is that the visible letter "á"
> should behave like any other single letter. For example, a
> text editor should move the cursor past it with a single
> click of a left or right arrow key. Also, if I perform a
> regular-expression search in the editor and look for
> 
>Alv[aá]rez
> 
> I should get a match with either Alvarez or Alvárez.

While what you say is relevant to _text_editors_ and sub
string searching tools, you have wandered beyond the topic
we are discussing here, which is practical interfacing
between a programmer and his/her strings. How a text editor
handles strings is irrelevant to a programmer. Unless of
course we are writing a custome text editor software
ourselves. In which case we can be the BDFL for a day, or
two. *wink*

> > And, in my definition, the whole Unicode is a huge
> > junkyard, to start with.
> 
> I don't think anybody denies that. However, it's the best
> thing available and—more importantly—a universally accepted
> standard.
> 
> > But opinions may vary, and in case you prefer or forced to
> > write "á", then it can be impractical to store it as two
> > characters, regardless of encoding.
> 
> Now I'm not following you.

Mikhail is referring to the claims made earlier in this
thread that accents are themselves distinct characters.
Which i think is utter hooey. For instance, some folks here
would wish for len("á") to return 2. Does that seem
reasonable?

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Rustom Mody

On Sunday, July 16, 2017 at 8:10:41 PM UTC+5:30, Rick Johnson wrote:
> On Sunday, July 16, 2017 at 2:55:57 AM UTC-5, Marko Rauhamaa wrote:
> > Mikhail V :
> > > On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
> > > >
> > > > Random access to code points is as uninteresting as
> > > > random access to UTF-8 bytes. I might want random access
> > > > to the "Grapheme clusters, a.k.a.real characters".
> > >
> > > What _real_ characters are you referring to? If your data
> > > has "á" (U00E1), then it is one real character, if you
> > > have "a" (U0061) and "ˊ" (U02CA) then it is _two_ real
> > > characters. So in both cases you have access to code
> > > points = real characters.
> > 
> > It's true that confusion is caused by the ambiguity of the
> > term "character."
> > 
> > > For metaphysical discussion - in _my_ definition there is
> > > no such "real" character as "á", since it is the "a" glyph
> > > with some dirt, so according to my definition, it should
> > > be two separate characters, both semantically and
> > > technically seen.
> > 
> > Here's the problem: when the human user types in "á" (with
> > one, two or three keyclicks), they don't know how the
> > computer represents it internally. The Unicode standard
> > allows for two *equivalent* code point sequences ( > https://en.wikipedia.org/wiki/Unicode_equivalence>). When
> > the computer outputs the sequence, the visible result is
> > the single letter "á". The human user doesn't know—or
> > care—about the internal representation.
> 
> *EXACTLY*. But your statement is far too general. Not only
> need not the _human_user_ be concerned with these low level
> aspects of strings, but the _programmer_ need not be concerned
> either. The programmer should only see strings from a
> practical standpoint:
> 
> "Can i index the chars within them?"
> 
> "Can i determine the length of them?"
> 
> "Can i slice and dice and combine them?"
> 
> "Can i trust that the character positions will maintain
> order?"
> 
> "Can i, and my target users, display them in a human
> readable form using various rendering specifications defined
> by graphic designers (aka: font-o-philes)?"
> 
> If the answer to all of these questions is *YES*, then you
> know all you need to know about strings. Now get to work!!!
> 
> > The user's expectation is that the visible letter "á"
> > should behave like any other single letter. For example, a
> > text editor should move the cursor past it with a single
> > click of a left or right arrow key. Also, if I perform a
> > regular-expression search in the editor and look for
> > 
> >Alv[aá]rez
> > 
> > I should get a match with either Alvarez or Alvárez.
> 
> While what you say is relevant to _text_editors_ and sub
> string searching tools, you have wandered beyond the topic
> we are discussing here, which is practical interfacing
> between a programmer and his/her strings. How a text editor
> handles strings is irrelevant to a programmer. Unless of
> course we are writing a custome text editor software
> ourselves. In which case we can be the BDFL for a day, or
> two. *wink*
> 
> > > And, in my definition, the whole Unicode is a huge
> > > junkyard, to start with.
> > 
> > I don't think anybody denies that. However, it's the best
> > thing available and—more importantly—a universally accepted
> > standard.
> > 
> > > But opinions may vary, and in case you prefer or forced to
> > > write "á", then it can be impractical to store it as two
> > > characters, regardless of encoding.
> > 
> > Now I'm not following you.
> 
> Mikhail is referring to the claims made earlier in this
> thread that accents are themselves distinct characters.
> Which i think is utter hooey. For instance, some folks here
> would wish for len("á") to return 2. Does that seem
> reasonable?

$ python
Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len("á")
1
>>> len("á")
2

Shall we stipulate it to be 1.5? [¿ Maybe 1½ ?]
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Rick Johnson

On Sunday, July 16, 2017 at 10:41:02 AM UTC-5, Rustom Mody wrote:
> On Sunday, July 16, 2017 at 8:10:41 PM UTC+5:30, Rick Johnson wrote:
> > On Sunday, July 16, 2017 at 2:55:57 AM UTC-5, Marko Rauhamaa wrote:
> > > Mikhail V :
> > > > On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:

[...]

> > Mikhail is referring to the claims made earlier in this
> > thread that accents are themselves distinct characters.
> > Which i think is utter hooey. For instance, some folks
> > here would wish for len("á") to return 2. Does that seem
> > reasonable?
> 
> $ python
> Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> len("á")
> 1
> >>> len("á")
> 2
> 
> Shall we stipulate it to be 1.5? [¿ Maybe 1½ ?]

Well, heck. If we are wad into the fraction weeds as it
relates to "character decorations" (aka: accents), we should
at least be realistic about it. For instance, the bounding
box of that *AHEM* "spec of dirt" (aka: accent) above the
"a" is hardly half the size of the bounding box that
contains the "a" itself. If i were to guess, i would say
something around 0.1-ish of a "real character". So if we are
accept your implementation, `len("á")` would return ~1.1.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Ben Finney

Steven D'Aprano  writes:

> On Sun, 16 Jul 2017 12:33:10 +1000, Ben Finney wrote:
>
> > And yet the ASCII and Unicode standard says code point 0x0A (U+000A
> > LINE FEED) is a character, by definition.
> [...]
> > > Is an acute accent a character?
> > 
> > Yes, according to Unicode. ‘´’ (U+0301 ACUTE ACCENT) is a character.
>
> Do you have references for those claims?

The Unicode Standard http://www.unicode.org/versions/Unicode10.0.0/>
frequently uses “character” as the unit of semantic value that Unicode
deals in. See the “Contents” table for many references.

In §2.2 under the sub-heading “Characters, Not Glyphs” it defines the
term, and thereafter uses “character” in a way that includes all such
units, even formatting codes.

See §2.11 “Combining Characters” for a definition that includes accent
characters like U+0301:

Combining Characters. Characters intended to be positioned relative
to an associated base character are depicted in the character code
charts above, below, or through a dotted circle.

The standard even uses the term “format characters” to refer to code
points with a functional purpose and no glyph representation, such as
U+000A LINE FEED.

> Because I'm pretty sure that Unicode is very, very careful to never
> use the word "character" in a formal or normative manner, only as an
> informal term for "the kinds of things that regular folk consider
> letters or characters or similar".

I don't know whether you consider the Core Specification document to be
speaking in “formal or normative manner”. Either way that doesn't affect
my point that Unicode does define “character” and it includes all code
points in that definition.

If you're going to disqualify anything that isn't “formal and normative
manner” from what we're allowed to infer as the Unicode Standard telling
us is a character, then you're going to have to either disregard most of
the Core Specification document, or allow it as formal and/or normative.

> And I don't think regular folks would know what a line feed was if it
> jumped out of their computer and bit them :-)

Are we talking about definitions, or are we talking about what regular
folks would know?

Regular folks know that “fish” has meaning, but I wouldn't want to try
matching that regular-folk knowledge with a definition of what a “fish”
is and is not. Quite frequently, a definition useful for a formal
standard is *not* coterminus with what regular folk will think is in our
out of that definition.

-- 
 \   “I have said to you to speak the truth is a painful thing. To |
  `\  be forced to tell lies is much worse.” —Oscar Wilde, _De |
_o__) Profundis_, 1897 |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Ben Finney

Ben Finney  writes:

> Steven D'Aprano  writes:
>
> > Do you have references for those claims?
>
> The Unicode Standard http://www.unicode.org/versions/Unicode10.0.0/>
> frequently uses “character” as the unit of semantic value that Unicode
> deals in. See the “Contents” table for many references.

I omitted to say (though it becomes clearer later in my message) that
these references are all in the Core Specification document of the
Unicode Standard, version 10.0.0.

-- 
 \  “There's a certain part of the contented majority who love |
  `\anybody who is worth a billion dollars.” —John Kenneth |
_o__)Galbraith, 1992-05-23 |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Write this accumuator in a functional style

2017-07-16 Thread Pavol Lisy

On 7/14/17, Steve D'Aprano  wrote:
> On Fri, 14 Jul 2017 09:06 am, Ned Batchelder wrote:
>
>> Steve's summary is qualitatively right, but a little off on the
>> quantitative
>> details.  Lists don't resize to 2*N, they resize to ~1.125*N:
>>
>> new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 :
>> 6);
>>
>> (https://github.com/python/cpython/blob/master/Objects/listobject.c#L49-L58)
>
> Ah, thanks for the correction. I was going off vague memories of long-ago
> discussion (perhaps even as long ago as Python 1.5!) when Tim Peters (I
> think
> it was) described how list overallocation worked.

You could remember it from sets:

return set_table_resize(so, so->used>5 ? so->used*2 : so->used*4);

(https://github.com/python/cpython/blob/master/Objects/setobject.c#L239)
-- 
https://mail.python.org/mailman/listinfo/python-list

Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Mikhail V

>> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
>>> Random access to code points is as uninteresting as random access to
>>> UTF-8 bytes. I might want random access to the "Grapheme clusters,
>>> a.k.a.real characters".
>>
>> What _real_ characters are you referring to?
>> If your data has "á" (U00E1), then it is one real character,
>> if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
>> real characters. So in both cases you have access to code points =
>> real characters.

>It's true that confusion is caused by the ambiguity of the term
>"character."

Yes, but you have said "I might want random access to the "Grapheme clusters,
a.k.a. real characters" and I had impression that you have some concrete
concept of grapheme clusters and some (generally useful) example of
implementation.
Without concrete examples it is just juggling with the terms.

>> But opinions may vary, and in case you prefer or forced to write "á",
>> then it can be impractical to store it as two characters, regardless
>> of encoding.

> Now I'm not following you.

For example, I want to type in cyrillic " рекá " (with an acute accent to denote
the stress on the last vowel, say for a pronunciation tutorial).
Most frequent solution to it would be just typing á instead of a.
And it is indeed most pratical: if I use modifier acute accent
character instead,
then it will be hard to select/paste such text and it will  not render
accurately.

Obvious consequences we have: á is not from the cyrillic code range,
eg. it will break hyphenation rules, and it will look consistent only
if the cyrillic font's "a" has exactly the same look as the latin "a".
Not to tell that it is not always possible to find the glyph with the
'right kind of dirt around'.
For such cases, technically better solution would be using separate
accent character to denote a stroke. In case of font issues it would
at least render as, say an apostrophe.
Still in practice, just typing "á" works better because editors and
even some professional DTP software cannot handle context-based glyph
rendering well.
In other words, I think the internal representation should use
separate modifier character, despite it seems impractical from many
points of view. And it _is_ impractical in case one has such
things as "á" as frequent character in normal writing (the latter
should not be the case for adequate modern writing system though).

Mikhail
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Steve D'Aprano

On Mon, 17 Jul 2017 01:40 am, Rustom Mody wrote:

> On Sunday, July 16, 2017 at 8:10:41 PM UTC+5:30, Rick Johnson wrote:
[...] 
> $ python
> Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
 len("á")
> 1
 len("á")
> 2
> 
> Shall we stipulate it to be 1.5? [¿ Maybe 1½ ?]

Please don't feed the trolls. If you have to respond to Ranting Rick, at least
write something sensible that people following this thread might learn from,
instead of encouraging his nonsense.

I don't believe for a second you seriously would like len(some_string) to
return '1½', but just in case anyone is taking that proposal seriously, that
would break backwards compatibility. len() must return an int, not a float, a
complex number, or a string.

If you want to know the length of a string *in bytes*, you have to encode it to
bytes first, using some specific encoding, then call len() on those bytes.

If you want to know the length of a string *in code points*, then just call
len() on the string.

If you want to know the height or width of a string in pixels in some specific
font, see your GUI toolkit.

If you want to know the length of a string in "characters" (graphemes), well,
Python doesn't have a built-in function to do that, or a standard library
solution. Yet.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Rick Johnson

On Sunday, July 16, 2017 at 8:28:57 PM UTC-5, Steve D'Aprano wrote:
> On Mon, 17 Jul 2017 01:40 am, Rustom Mody wrote:
> 
> > On Sunday, July 16, 2017 at 8:10:41 PM UTC+5:30, Rick Johnson wrote:
> [...] 
> > $ python
> > Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00)
> > [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> > Type "help", "copyright", "credits" or "license" for more information.
>  len("á")
> > 1
>  len("á")
> > 2
> > 
> > Shall we stipulate it to be 1.5? [¿ Maybe 1½ ?]
> 
> If you have to respond to Ranting Rick, at least write
> something sensible that people following this thread might
> learn from, instead of encouraging his nonsense.

Oh Steven. You couldn't win with that ridiculous Toupee
Fallacy (which i plucked from your highly specular crown,
BTW [1]) so now you resort to your old trusty and rusty
tactic of the ad hominem. @_@ Why are we not surprised?

BTW, i noticed your first name is missing the trailing "n"
character. What gives? I can only assume that in a
submissive gesture towards your buddy Chris, you replaced
the "n" with a zero-width space . No? Hmm. Or
perhaps you forget your password, again, and had no other
choice but to create a new account?

Well don't be sad. :-'(

In fact, cheer up. :-)

Things could be worse, ya know. O;-)

PS: Now sod off to that gated community otherwise known as
Python-ideas, where you can hide behind your moderator's
coat-tails, and spend your time bikeshedding with the other
snobbish ilk who infest that group. These are free and open
forums here, and your fascist manipulations (no matter how
PC they may be) are not welcome here. Who the *HELL* do you
think you are, lecturing other people about who they may or
may _not_ communicate with?

[1] Yeah, i saw that "bad toupee" a mile away!
-- 
https://mail.python.org/mailman/listinfo/python-list

Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Mikhail V

 ChrisA wrote:

>On Sun, Jul 16, 2017 at 2:33 PM, Rustom Mody  wrote:
>> Right now in an adjacent mailing list (debian) I see someone signed off with 
>> a
>>
>> grüß
>>
>> I guess the third character is a u with some ‘dirt’
>> Whats the fourth?

>It's a "sharp S".

or "Eszett", is a merge of two symbols that were used in old german
texts: "f"-like glyph
and "s" glyph, i.e. sort of ligature. Or simply, ß is a symbol that is
quite similar to "B".

I would just write : gruss
since it is simpler to type and has cleaner look.
"ß" is sort of deprecated, often subsituted with "ss". If I am not mistaken,
this substitution is oficially allowed in many regions (what a liberality!).

>>Heck even in the English that I learnt in school we had
>>ægis, homœopath etc

Similar to the above, historical symbols. These are (should be)
deprecated due to
legibility issues, roughly speaking. OTOH good for freaking-out.
Like: I was in Ægypt. and a reader so: aaagypt

 ChrisA wrote:
>Tell me, is "å" an a with some 'dirt', or is it a separate character?

From the way you are asking, it seems that you are planning some tricky
business again... Hope not to argue on terminology again, å simply
makes the text flow inconsistent, such things are parasitic for
readability regardless if someone proclaims it a separate character or
not. In a reader-oriented medium should be used only as a last resort.

Looks like "a" whith a circle above, so yes, an "a" with a good deal of dirt.

>Is "i" an ı with some dirt, or a separate letter? Oh wait, you
>probably think that "i" is a letter, and "ı" is the same letter but
>with some dirt missing.

 "i" is a letter, you can't just remove the dot. So there can be just
dirt and there is
'dirt' which is in fact the natural part of the letter. Like a serif
for example.
but I am not expecting your acceptance of these statements,
I am just telling what follows from my long experience with the topic.
Though you can try to replace "i" with "ı" globally in a text and there
are chances you will notice something. Then you can try also with å.

>What about "p"? Is that just "d" written the
>wrong way up?

Sort of. The early designers did not find a better solution than taking
the rotated version of one glyph. Are you curious about all other letters?
Then probably you should start trying to design a legible typeface.
But ideally you should try to design a typeface from scratch, say some
20 glyphs,
not just a Latin-based variation, but truly from scratch. Then some
question should become more transparent, words are too weak in
transmitting these kind of things.

>At what point does something merit being called a
>different letter?

For truly different, when the structural difference is significant,
i.e. much more significant than the difference
between "ı" and "i". Yes in Turkish both are used.
And what can I say, its misfortunate for the users: suboptimal for
legibility + non ascii typing.
But could be much worse, look at Vietnamese writings.

Mikhail
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Chris Angelico

On Mon, Jul 17, 2017 at 12:28 PM, Mikhail V  wrote:
>  ChrisA wrote:
>>Tell me, is "å" an a with some 'dirt', or is it a separate character?
>
> From the way you are asking, it seems that you are planning some tricky
> business again... Hope not to argue on terminology again, å simply
> makes the text flow inconsistent, such things are parasitic for
> readability regardless if someone proclaims it a separate character or
> not. In a reader-oriented medium should be used only as a last resort.
>
> Looks like "a" whith a circle above, so yes, an "a" with a good deal of dirt.

Norwegian people might take issue with that. It's not "a with circle
above", it's the distinct letter å (pronounced as per the sound the
letter represents, approximately "aw").

>>Is "i" an ı with some dirt, or a separate letter? Oh wait, you
>>probably think that "i" is a letter, and "ı" is the same letter but
>>with some dirt missing.
>
>  "i" is a letter, you can't just remove the dot. So there can be just
> dirt and there is
> 'dirt' which is in fact the natural part of the letter. Like a serif
> for example.
> but I am not expecting your acceptance of these statements,
> I am just telling what follows from my long experience with the topic.
> Though you can try to replace "i" with "ı" globally in a text and there
> are chances you will notice something. Then you can try also with å.

Yep! Nobody would take any notice of the fact that you just put dots
on all those letters. It's not like it's going to make any difference
to anything. We're not dealing with matters of life and death here.

Oh wait.

https://www.theinquirer.net/inquirer/news/1017243/cellphone-localisation-glitch

I'll leave you with that thought.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Rustom Mody

On Monday, July 17, 2017 at 6:58:57 AM UTC+5:30, Steve D'Aprano wrote:
> On Mon, 17 Jul 2017 01:40 am, Rustom Mody wrote:
> 
> > On Sunday, July 16, 2017 at 8:10:41 PM UTC+5:30, Rick Johnson wrote:
> [...] 
> > $ python
> > Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00)
> > [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> > Type "help", "copyright", "credits" or "license" for more information.
>  len("á")
> > 1
>  len("á")
> > 2
> > 
> > Shall we stipulate it to be 1.5? [¿ Maybe 1½ ?]
> 
> Please don't feed the trolls. 

Its usually called 'joke' Steven! Did the word fall out of your dictionary 
in the last upgrade?
Rick was no more trolling than Marko or you or Chris or Mikhail or anyone else
If anyone's trolling its me…  len("á") == 1½ is so obviously nonsense on so 
many levels I did not think
"And now ladies (are there any?) and gentlemen I am going to tell a joke!"
would be necessary

On a more serious note every other post on this (as on many discussing unicode
more broadly) is so ridiculously Euro (or Anglo) centric I would not know where
to begin.
Witness your own…

> If you have to respond to Ranting Rick, at least
> write something sensible that people following this thread might learn from,
> instead of encouraging his nonsense.
> 
> I don't believe for a second you seriously would like len(some_string) to
> return '1½', but just in case anyone is taking that proposal seriously, that
> would break backwards compatibility. len() must return an int, not a float, a
> complex number, or a string.
> 
> If you want to know the length of a string *in bytes*, you have to encode it 
> to
> bytes first, using some specific encoding, then call len() on those bytes.
> 
> If you want to know the length of a string *in code points*, then just call
> len() on the string.
> 
> If you want to know the height or width of a string in pixels in some specific
> font, see your GUI toolkit.
> 
> If you want to know the length of a string in "characters" (graphemes), well,
> Python doesn't have a built-in function to do that, or a standard library
> solution. Yet.

You've given 4 ifs.
An L-language may would assume that the atomic units of language-L would 
be supported.  Your 4th if suggests thats ok. Is it?

Hint1: Ask your grandmother whether unicode's notion of character makes sense. 
Ask 10 gmas from 10 language-L's
Hint2: When in doubt gma usually is right

PS Claims such as Euro (or some other) centricism usually imply a corresponding 
call for "rights" "equality" etc
No such politically correct call is being made or implied (by me)
There never was equality in the world; there never will be
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-16 Thread Chris Angelico

On Mon, Jul 17, 2017 at 2:10 PM, Rustom Mody  wrote:
> Hint1: Ask your grandmother whether unicode's notion of character makes sense.
> Ask 10 gmas from 10 language-L's
> Hint2: When in doubt gma usually is right

Often, but definitely not always. For instance, your grandmother
probably wouldn't think of "newline" as a character. Quite possibly
wouldn't count space, either. On the other hand, I'm pretty sure my
grandmothers would have counted Sherlock Holmes as a character.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Connecting Google News

2017-07-16 Thread dieter

Javier Bezos  writes:
> Google News used to fail with the high level functions provided by
> httplib and the like. However, I found this piece of code somewhere:
> ...
> A few days ago, Google News has been revamped and it doesn't work any
> more (2.6/Win7, 2.7/OSX and, with minimal changes, 3.6/Win7), because
> the page contents is empty. The code itself doesn't raise any
> errors. Which is the proper way to do it now? I must stick to the
> standard libraries.
>
> The returned headers are:
>
> --
> [('Content-Type', 'application/binary'),
> ...
>  ('Location', 'https://news.google.com/news/?ned=es_mx&hl=es'),
> ...
>
> `status` is 302.

`status == 302` means a redirect; "Location" gives the new
url (to be redirected to).

-- 
https://mail.python.org/mailman/listinfo/python-list

[RELEASE] Python 3.6.2 is now available

2017-07-16 Thread Ned Deily

On behalf of the Python development community and the Python 3.6 release
team, I am happy to announce the availability of Python 3.6.2, the
second maintenance release of Python 3.6.  3.6.0 was released on 2016-12-22
to great interest and we are now providing the second set of bugfixes and
documentation updates for it; the first maintenance release, 3.6.1, was
released on 2017-03-31.  Detailed information about the changes made in
3.6.2 can be found in the change log here:

https://docs.python.org/3.6/whatsnew/changelog.html#python-3-6-2

Please see "What’s New In Python 3.6" for more information about the
new features in Python 3.6:

https://docs.python.org/3.6/whatsnew/3.6.html

You can download Python 3.6.2 here:

https://www.python.org/downloads/release/python-362/

The next maintenance release of Python 3.6 is expected to follow in
about 3 months, around the end of 2017-09.  More information about the
3.6 release schedule can be found here:

https://www.python.org/dev/peps/pep-0494/

Enjoy!

P.S. If you need to download the documentation set for 3.6.2
immediately, you can always find the release version here:
https://docs.python.org/release/3.6.2/download.html

The most current updated versions will appear here:
https://docs.python.org/3.6/

--
  Ned Deily
  n...@python.org -- []

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

Re: is @ operator popular now?

Re: Grapheme clusters, a.k.a.real characters

Connecting Google News

Re: Connecting Google News

Re: Connecting Google News

Difference in behavior of GenericMeta between 3.6.0 and 3.6.1

Re: Connecting Google News

Re: Connecting Google News

Re: Connecting Google News

Re: Difference in behavior of GenericMeta between 3.6.0 and 3.6.1

Re: Difference in behavior of GenericMeta between 3.6.0 and 3.6.1

Re: Grapheme clusters, a.k.a.real characters

Re: Grapheme clusters, a.k.a.real characters

Re: Grapheme clusters, a.k.a.real characters

Re: Grapheme clusters, a.k.a.real characters

Re: Grapheme clusters, a.k.a.real characters

Re: Write this accumuator in a functional style

Grapheme clusters, a.k.a.real characters

Re: Grapheme clusters, a.k.a.real characters

Re: Grapheme clusters, a.k.a.real characters

Grapheme clusters, a.k.a.real characters

Re: Grapheme clusters, a.k.a.real characters

Re: Grapheme clusters, a.k.a.real characters

Re: Grapheme clusters, a.k.a.real characters

Re: Connecting Google News

[RELEASE] Python 3.6.2 is now available

27 matches

Site Navigation

Mail list logo

Footer information