Re: Best practice for caching hash

2022-03-16 Thread Cameron Simpson
On 16Mar2022 16:58, Chris Angelico  wrote:
>On Wed, 16 Mar 2022 at 14:00, Cameron Simpson  wrote:
>> On 16Mar2022 10:57, Chris Angelico  wrote:
>> A significant difference is that tuples have no keys, unlike a dict.
>>
>> A hash does not have to hash all the internal state, ony the relevant
>> state, and not even all of that. The objective of the hash is twofold to
>> my mind:
>> - that "equal" objects (in the `==` sense) have the same hash, so that
>>   they hash to the same backet in dicts and can therefore be found
>> - that hash values are widely distributed, to maximise the evenness of
>>   the object distribution in buckets
>>
>> For dicts to work, the former has to hold.
>
>Python only demands the first one.

I think we're in violent agreement here.

>And Python's own integers hash to
>themselves, which isn't what I'd call "widely distributed", but which
>works well in practice.

This may be because the "raw" hash (in this case the int value) is 
itself further hashed (giving more evenness) and then moduloed into the 
finite number of slots in the dict.

>> The latter has more flexibility. A dict has keys. If the dicts are quite
>> varied (sets of tags for example), it may be effective to just hash the
>> keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
>> likewise with database rows) this will go badly because the hashes will
>> all (or maybe mostly) collide.
>
>The general recommendation is to use all the same data for hashing as
>you use for equality checking. What's the point of just hashing the
>keys, if the values also contribute to equality? I'd just keep it
>simple, and hash both.

Well, hash functions want to be fast. The case where you look up a key 
in a dict (which needs both the hash and a comparison) is not the only 
situation. When a dict's bucket array is resized, the hash functions of 
all stored objects require recomputation, but the equality test is _not_ 
required. Is the hash is very cheap, that is a large benefit here.

I just took a look at the CPython hashtable.c, you can see the rehashing 
process here:


https://github.com/python/cpython/blob/7c776521418676c074a483266339d31c950f516e/Python/hashtable.c#L280

Just an aside, that's got a bunch of interesting opimitsations in it, 
people has put serious work into making this fast and it remains very 
readable!

Suppose I have a nested tree of (hashable) frozen dicts of whatever 
flavour. An equality test requires a deep comparison all the way down 
including the values. You could make a much cheaper hash function which 
hashed just keys, yea, even only unto the upper layer or 2, and still 
meet the primary criterion: all equal items have the same hash value.

My recommendation is not inherently to hash everything involved in the 
equality test. If that's cheap, then sure. But it isn't necessary and 
(given a suitable use domain) it may even be very detrimental to hash 
everything involved in equality, as in the example above.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Best practice for caching hash

2022-03-16 Thread Chris Angelico
On Wed, 16 Mar 2022 at 18:48, Cameron Simpson  wrote:
>
> On 16Mar2022 16:58, Chris Angelico  wrote:
> >On Wed, 16 Mar 2022 at 14:00, Cameron Simpson  wrote:
> >> On 16Mar2022 10:57, Chris Angelico  wrote:
> >> A significant difference is that tuples have no keys, unlike a dict.
> >>
> >> A hash does not have to hash all the internal state, ony the relevant
> >> state, and not even all of that. The objective of the hash is twofold to
> >> my mind:
> >> - that "equal" objects (in the `==` sense) have the same hash, so that
> >>   they hash to the same backet in dicts and can therefore be found
> >> - that hash values are widely distributed, to maximise the evenness of
> >>   the object distribution in buckets
> >>
> >> For dicts to work, the former has to hold.
> >
> >Python only demands the first one.
>
> I think we're in violent agreement here.
>
> >And Python's own integers hash to
> >themselves, which isn't what I'd call "widely distributed", but which
> >works well in practice.
>
> This may be because the "raw" hash (in this case the int value) is
> itself further hashed (giving more evenness) and then moduloed into the
> finite number of slots in the dict.

Not in a CPython dict, no. (There is some collision management that
uses upper bits, but the initial selection of bucket simply uses the
raw hash.) See comments at the top of dictobject.c.

"""
This isn't necessarily bad!  To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints. So
this gives better-than-random behavior in common cases, and that's very
desirable.
"""

> >> The latter has more flexibility. A dict has keys. If the dicts are quite
> >> varied (sets of tags for example), it may be effective to just hash the
> >> keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
> >> likewise with database rows) this will go badly because the hashes will
> >> all (or maybe mostly) collide.
> >
> >The general recommendation is to use all the same data for hashing as
> >you use for equality checking. What's the point of just hashing the
> >keys, if the values also contribute to equality? I'd just keep it
> >simple, and hash both.
>
> Well, hash functions want to be fast. The case where you look up a key
> in a dict (which needs both the hash and a comparison) is not the only
> situation. When a dict's bucket array is resized, the hash functions of
> all stored objects require recomputation, but the equality test is _not_
> required. Is the hash is very cheap, that is a large benefit here.

Fast is less important than accurate. Hashing functions are usually
*fast enough*, and since the vast majority of the work of hashing is
usually going to be strings, every other data type can just
recalculate its hash based on its children. Caching the hash of a
string is very useful; caching the hash of a tuple, not so much; again
quoting from the CPython source code:

/* Tests have shown that it's not worth to cache the hash value, see
   https://bugs.python.org/issue9685 */

> I just took a look at the CPython hashtable.c, you can see the rehashing
> process here:
>
> 
> https://github.com/python/cpython/blob/7c776521418676c074a483266339d31c950f516e/Python/hashtable.c#L280

Not sure where that file is used; it's purely internal to CPython and
seems to be used for tracemalloc. It's not the dictionary type -
that's in Objects/dictobject.c.

In any case, the rehash operation is basically for after a resize on
the hashtable, and it doesn't rely on the objects caching their hashes
- it retains them in the hashtable.

> Just an aside, that's got a bunch of interesting opimitsations in it,
> people has put serious work into making this fast and it remains very
> readable!
>
> Suppose I have a nested tree of (hashable) frozen dicts of whatever
> flavour. An equality test requires a deep comparison all the way down
> including the values. You could make a much cheaper hash function which
> hashed just keys, yea, even only unto the upper layer or 2, and still
> meet the primary criterion: all equal items have the same hash value.

Yes, but "def __hash__(self): return 42" also meets the primary
criterion. I don't know what use-cases frozendicts have, but I would
suspect that if they are used at all, they'll often be used in cases
where their keys are identical (for instance, the __dict__ of an
immutable object type, where the keys are extremely stable across
objects of the same type).

> My recommendation is not inherently to hash everything involved in the
> equality test. If that's cheap, then sure. But it isn't necessary and
> (given a suitable use domain) it may even be very detrimental to hash
> everything involved in equality, as in the example above.
>

My recommendation is to hash everything involved in the equality test,
unless there's demonstrable benefit from not doing so.

ChrisA
-- 
https://mail.pyth

Re: Best practice for caching hash

2022-03-16 Thread Cameron Simpson
On 16Mar2022 19:09, Chris Angelico  wrote:
>On Wed, 16 Mar 2022 at 18:48, Cameron Simpson  wrote:
>> This may be because the "raw" hash (in this case the int value) is
>> itself further hashed (giving more evenness) and then moduloed into the
>> finite number of slots in the dict.
>
>Not in a CPython dict, no. (There is some collision management that
>uses upper bits, but the initial selection of bucket simply uses the
>raw hash.) See comments at the top of dictobject.c.

Thank you, I stand corrected. That's a very interesting read.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Best practice for caching hash

2022-03-16 Thread Marco Sulla
On Wed, 16 Mar 2022 at 00:42, Cameron Simpson  wrote:
>
> Is it sensible to compute the hash only from the immutable parts?
> Bearing in mind that usually you need an equality function as well and
> it may have the same stability issues.

[...]

> In that case I would be inclined to never raise TypeError at all. I'd
> compute the hash entirely from the keys of the dict and compute equality
> in the normal fashion: identical keys and then equal corresponding
> values. That removes the requirement that values be immutable and/or
> hashable.

Well, I followed PEP 416, so I allowed immutable types for values, as
tuple does. Also tuple is hashable only if all its values are
hashable.

The equality function is the same as dict, with a little modification.
I do not check for hash in equality. I could add that, if both the
hash are calculated and different from -1 and they differ, False is
returned.

> >In this case I currently cache the value -1. The subsequent calls to
> >__hash__() will check if the value is -1. If so, a TypeError is
> >immediately raised.
>
> This will also make these values behave badly in dicts/sets, as they all
> hash to the same bucket.

Not sure to understand. If the hash is -1, it's not hashable, so it
can't be a member of a dict or set.

> You could, you know, cache the original exception.

I thought about it :) What prevented me is that is another PySsize_t
to store in memory
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Reportlab / platypus bug?

2022-03-16 Thread Les
Greg Ewing  ezt írta (időpont: 2022. márc.
16., Sze, 1:01):

> On 16/03/22 2:20 am, Les wrote:
> > I tried to subscribe (twice), but never got the confirmation
> > email. Checked in the spam folder too, but it is nowhere to be found.
>
> Is there any chance your email provider has some kind of quarantine
> system separate from your spam folder?
>
Probably not, I was trying to register with gmail.

It is interesting, I did not get the confirmation email (tried twice). Next
day, I got it instantly.

>
>
-- 
https://mail.python.org/mailman/listinfo/python-list


[RELEASE] Python 3.10.3, 3.9.11, 3.8.13, and 3.7.13 are now available with security content

2022-03-16 Thread Łukasz Langa
Welcome again to the exciting world of releasing new Python versions!

Last time around I was complaining about cursed releases 
.
 This time around I could complain about security content galore and how one of 
them 
 
ruined my ingenious idea to release Python on Pi Day and call it Py Day 
.
 Well, you can’t have everything in life. Or at least not everything at once.

And yet it seems this time around we’ve got a lot of security fixes all at 
once. Just look at this list:

15 (sic!) CVEs: libexpat upgraded from 2.4.1 to 2.4.7 (BPO-46794 
, BPO-46932 
, BPO-46811 
, BPO-46784 
, BPO-46400 
)
CVE-2022-0778: OpenSSL upgraded from 1.1.1l to 1.1.1n in macOS and Windows 
installers (BPO-47024 )
CVE-2016-3189, CVE-2019-12900: bzip2 upgraded from 1.0.6 to 1.0.8 in Windows 
installers (BPO-44549 )
CVE-2022-26488: Windows installer now ensures the correct path is being 
repaired when “Add to PATH” is used (BPO-46948 
)
CVE-2021-28363: bundled pip upgraded from 21.2.4 to 22.0.4 (BPO-46985 
)
authorization bypass fixed in urllib.request (BPO-46756 
)
REDoS avoided in importlib.metadata (BPO-46474 
)
SQLite upgraded from 3.36.0 to 3.37.2 in macOS and Windows installers 
(BPO-45925 )
 
Python
 3.10.3

Get it here: https://www.python.org/downloads/release/python-3103/ 

Python 3.10.3 is the third maintenance release of the newest version of the 
Python programming language, which contains many new features and 
optimizations. We recommend it over the other releases listed below.

This is a large bugfix release with 220 commits since 3.10.2. Just look at the 
change log !

The next maintenance release of Python 3.10 is planned for early June.

 
Python
 3.9.11

Get it here: https://www.python.org/downloads/release/python-3911/  

This is the penultimate planned full bugfix release of Python 3.9. In mid-May, 
we’ll be releasing the last one, after which the 3.9 series will enter its 
security-only fixes period. More details in PEP 596 
.

Python 3.9 is the first Python version since 2.7 to have a regular bugfix 
release larger than “.10”. It’s also still a significant release at 163 commits 
since 3.9.10. That’s in fact 30+ commits more than between 3.9.9 and 3.9.10. 
The change log  
isn’t as long as the 3.10.3 one but it’s still pretty extensive!

As a reminder, on macOS, the default installer is now the new universal2 
variant. It’s compatible with Mac OS X 10.9 and newer, including macOS 11 Big 
Sur and macOS 12 Monterey. Python installed with this variant will work 
natively on Apple Silicon processors.

 
Python
 3.8.13

Get it here: https://www.python.org/downloads/release/python-3813/ 

Changes  here 
are almost exclusively security-only as the life cycle of Python versions 
prescribes. You might have noticed there is a small number of regular bug fixes 
nonetheless. This is because without those we wouldn’t be able to continue 
running the full test suite for the 3.8 branch. This in turn could hide 
regressions in future security fixes.

 
Python
 3.7.13

Get it here: https://www.python.org/downloads/release/python-3713/ 

Just like 3.8, Python 3.7 is in its security-only fixes period. In turn, the 
changes in 3.7.13 
 look almost 
identical to the ones in 3.8.13.

Python 3.7 will cont

Re: Best practice for caching hash

2022-03-16 Thread Marco Sulla
On Wed, 16 Mar 2022 at 00:59, Chris Angelico  wrote:
>
> (Though it's a little confusing; a frozendict has to have nothing but
> immutable objects, yet it permits them to be unhashable?

It can have mutable objects. For example, a key k can have a list v as
value. You can modify v, but you can't assign to the key k another
value w. It's the same with the tuples, as you said. An index i can
contain a list l. Since it's a tuple, you can't set another object at
the index i, but you can modify the list l.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Best practice for caching hash

2022-03-16 Thread Greg Ewing

On 16/03/22 6:58 pm, Chris Angelico wrote:
And Python's own integers hash to themselves, 

> which isn't what I'd call "widely distributed", but which
> works well in practice.

Not exactly, they seem to be limited to 60 bits:

>>> hex(hash(0xfff))
'0xfff'
>>> hex(hash(0x1fff))
'0x0'

And up to that limit they're as widely distributed as you
can get -- each integer hashes to a unique value!

But keep in mind that the hash value itself is not directly
used to locate a dict slot -- there is extra scrambling that
goes on in the dict code.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


basic question on loop & type casting / list/str/array

2022-03-16 Thread Kiran Kumar
Hi.

Pls check on below poython 3.9.x code & suggest how can i keep the string 
intactst in 2nd loop... ? these are aws ec2 ids

>>> INSTANCE_ID = ['i-0dccf1ede229ce1','i-0285506fee62051']

>>> for i in INSTANCE_ID:
...   print (i)
...
i-0dccf1ede229ce1
i-0285506fee62051
>>>
>>>
>>>
>>> for i in INSTANCE_ID:
...   for j in i:
... print (j)
...
i
-
0
d
c
c
f
1
e
d
e
2
2
9
c
e
1
i
-
0
2
8
5
5
0
6
f
e
e
6
2
0
5
1
>>>


Thanks 
Kiran
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: basic question on loop & type casting / list/str/array

2022-03-16 Thread Python

Kiran Kumar wrote:

Hi.

Pls check on below poython 3.9.x code & suggest how can i keep the string 
intactst in 2nd loop... ? these are aws ec2 ids


Don't loop through it then.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Best practice for caching hash

2022-03-16 Thread Marco Sulla
On Wed, 16 Mar 2022 at 09:11, Chris Angelico  wrote:
> Caching the hash of a
> string is very useful; caching the hash of a tuple, not so much; again
> quoting from the CPython source code:
>
> /* Tests have shown that it's not worth to cache the hash value, see
>https://bugs.python.org/issue9685 */

This is really interesting. Unluckily I can't use the pyperformance
benchmarks. I should use the code that uses frozendict, but I suppose
it's really hard...
Anyway this discourages me to continue to store unashable value, since
I should also store the error message. Storing only the hash when the
object is hashable is much cheaper, and maybe the extra field is not
so much a problem, since dict consumes more space than a tuple:

>>> sys.getsizeof({})
64
>>> sys.getsizeof(())
40
>>> sys.getsizeof({1:1})
232
>>> sys.getsizeof((1,))
48

> I don't know what use-cases frozendicts have, but I would
> suspect that if they are used at all, they'll often be used in cases
> where their keys are identical (for instance, the __dict__ of an
> immutable object type, where the keys are extremely stable across
> objects of the same type).

Well, I tried to implement them as dicts with shared keys, but I
abandoned it when Inada optimized dict(another_dict), where
another_dict is a compact dict. Since it's compact, you have "only" to
memcopy the entries (oversimplification).

I tried to do the same trick for the sparse dict structure, but
memcopy the keys and the values was not enough. I had to incref all
value *two* times and this slowed down the creation a lot. So I
decided to move to compact structure.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Best practice for caching hash

2022-03-16 Thread Cameron Simpson
On 16Mar2022 14:00, Marco Sulla  wrote:
>On Wed, 16 Mar 2022 at 00:42, Cameron Simpson  wrote:
>> >In this case I currently cache the value -1. The subsequent calls to
>> >__hash__() will check if the value is -1. If so, a TypeError is
>> >immediately raised.
>>
>> This will also make these values behave badly in dicts/sets, as they all
>> hash to the same bucket.
>
>Not sure to understand. If the hash is -1, it's not hashable, so it
>can't be a member of a dict or set.

Sorry, I misread you and took -1 to be a valid hash code, not a signal 
to raise TypeError.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Suggestion for Linux Distro (from PSA: Linux vulnerability)

2022-03-16 Thread 황병희
Dear Loris,

"Loris Bennett"  writes:

> (...thanks...)
> The sysadmins I know who are interested in long-term stability and
> avoiding unnecessary OS updates use Debian rather than Ubuntu,

+1; Reasonable!

Sincerely, Linux fan Byung-Hee

-- 
^고맙습니다 _地平天成_ 감사합니다_^))//
-- 
https://mail.python.org/mailman/listinfo/python-list