Re: Red Black Tree implementation?

2013-05-06 Thread duncan smith

On 03/05/13 03:00, Dan Stromberg wrote:


On Wed, May 1, 2013 at 7:06 PM, duncan smith mailto:buzzard@invalid.invalid>> wrote:

I have an implementation that you can try out. It's not based on any
other implementation, so my bugs will be independent of any bugs in
the code you're currently using. It looks more like a set - add,
remove, discard. Not tried on Python 3 or run through pylint. I just
tried adding a million items to a tree, and it takes about 25%
longer to add items at the end compared to those at the beginning.
Timing removals uncovered a bug. So if you want the code I'll fix
the bug and send it (to your gmail e-mail address?). Cheers.

Duncan
--
http://mail.python.org/__mailman/listinfo/python-list
<http://mail.python.org/mailman/listinfo/python-list>


What license?

Thanks!



Here's the text I usually prepend.


##Copyright (c) 2013 duncan g. smith
##
##Permission is hereby granted, free of charge, to any person obtaining a
##copy of this software and associated documentation files (the "Software"),
##to deal in the Software without restriction, including without limitation
##the rights to use, copy, modify, merge, publish, distribute, sublicense,
##and/or sell copies of the Software, and to permit persons to whom the
##Software is furnished to do so, subject to the following conditions:
##
##The above copyright notice and this permission notice shall be included
##in all copies or substantial portions of the Software.
##
##THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
##OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
MERCHANTABILITY,

##FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
##THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
##OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
##ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
##OTHER DEALINGS IN THE SOFTWARE.


Basically, "do what you want with it but don't blame me if it goes tits 
up". I'm happy to consider tidying it up a bit and using a more 
recognized form of licence. Just had a bank holiday here, so bug not yet 
squashed. But it is the sort of bug that might account for what you've 
seen (if a similar bug exists in the code you've been using). The tree 
doesn't always get properly rebalanced on node removals. I'll attack the 
problem later tomorrow (technically, later today). Cheers.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Red Black Tree implementation?

2013-05-07 Thread duncan smith

On 07/05/13 02:20, Dan Stromberg wrote:



[snip]


I'm starting to think Red Black Trees are pretty complex.




A while ago I looked at a few different types of self-balancing binary 
tree. Most look much easier to implement.


BTW, the licence might be MIT - I just copied it from someone else's code.

Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Red Black Tree implementation?

2013-05-08 Thread duncan smith

On 07/05/13 02:20, Dan Stromberg wrote:


On Mon, May 6, 2013 at 5:55 PM, duncan smith mailto:buzzard@invalid.invalid>> wrote:




[snip]



I'd prefer Apache or MIT or BSD 3-clause, but I could probably work with
this.
http://joinup.ec.europa.eu/community/eupl/news/licence-proliferation-way-out

I'm eager to see the code, and would love it if you sorted out the
deletion rebalance issue.

I just plunked some time into
https://github.com/headius/redblack/blob/master/red_black_tree.py , only
to find that it didn't appear to be doing deletions correctly - the tree
would become unprintable after deleting one element.  It's possible I
introduced the bug, but right now I don't particularly suspect so,
having not changed the __del__ method.

I'm starting to think Red Black Trees are pretty complex.




Mine is fixed now (sent to your gmail address). Restoring the tree 
properties after deletion is awkward to get right, and doesn't affect 
the performance noticeably for smallish trees if you get it wrong.


I realised my code was buggy when I tried adding, then removing a 
million items and ran into the recursion limit. It now passes a test 
where I check the tree properties after each addition / deletion.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Red Black Tree implementation?

2013-05-08 Thread duncan smith

On 09/05/13 02:40, Dan Stromberg wrote:

OK, I've got one copy of trees.py with md5
211f80c0fe7fb9cb42feb9645b4b3ffe.  You seem to be saying I should have
two though, but I don't know that I do...




I've just re-sent it.

Duncan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Red Black Tree implementation?

2013-05-09 Thread duncan smith

On 09/05/13 02:40, Dan Stromberg wrote:

OK, I've got one copy of trees.py with md5
211f80c0fe7fb9cb42feb9645b4b3ffe.  You seem to be saying I should have
two though, but I don't know that I do...



[snip]

Yes, 211f80c0fe7fb9cb42feb9645b4b3ffe is the correct checksum for the 
latest version. The previous version had an issue when adding 
non-distinct items (items that compare equal to items already in the 
tree). Cheers.


Duncan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Red Black Tree implementation?

2013-05-11 Thread duncan smith

On 12/05/13 00:24, Dan Stromberg wrote:


I'm afraid I'm having some trouble with the module.  I've checked it
into my SVN at
http://stromberg.dnsalias.org/svn/red-black-tree-mod/trunk/duncan

I have two versions of your tests in there now - "t" is minimally
changed, and test-red_black_tree_mod is pretty restructured to
facilitate adding more tests later.  I get the same problem with either
version of the tests.

The problem I'm seeing is that the tree, when built from items, isn't
looking quite right.  I inserted a print(tree) into the for loop, and
I'm getting the following, where I expected the tree to grow by one
element on each iteration:

$ python t
6 False None None
6 False 3 None
6 False 3 15
6 False 3 15
6 False 3 11
6 False 3 11
6 False 3 11
11 False 6 15
11 False 6 15
11 False 6 15
11 False 6 15
11 False 6 15
11 False 6 15
11 False 6 15
11 False 6 15
11 False 6 15
11 False 6 15
11 False 6 15
11 False 6 15
11 False 6 15

Thoughts?

BTW, printing an empty tree seems to say "sentinel".  'not sure if that
was intended.

Thanks!



The leaf node has parent equal to None. All tree nodes have two 
children. One or both children may be sentinels, and a sentinel is 
signified by having both left and right (children) equal to None. So an 
empty tree is a sentinel node that is also root. So the string 
"sentinel" is expected (although possibly not the most sensible option).


For non-sentinel nodes the string is generated by,

return '%s %s %s' % (self.data, self.left.data, self.right.data)

for the BinaryTree class, and by

return '%s %s %s %s' % (self.data, self.is_red, self.left.data, 
self.right.data)


for the RedBlackTree class.


So what is being printed above is (in each case) the value contained in 
the root node, followed by its colour (True if red), and the values 
contained in the root node's left and right children.


The root node remains root, although it's value and its children (and 
their values) might change due to tree rotations.


It looks OK to me. The empty tree would print "sentinel". After adding 
the value 6 there is one tree node with sentinels as children (values 
equal to None). Adding 3 results in 3 being the value of the root's left 
child. It's right child is still a sentinel. Adding 15 results in that 
value being assigned to the right child. Adding 9 results in no change 
to the values in the root or its children. Adding 11 results in a tree 
rotation and 11 becomes the value in the right child of the root. At a 
later point a tree rotation results in the value of the root node being 
changed.


I haven't implemented a way of representing the structure of the whole 
red black tree. I would probably write some code to generate a dot file 
and use that to generate a png. But you could add something like,


print tree.height, tree.size, list(tree)

and get output like,

0 1 [6]
1 2 [3, 6]
1 3 [3, 6, 15]
2 4 [3, 6, 9, 15]
3 5 [3, 6, 9, 11, 15]
4 6 [3, 6, 9, 11, 12, 15]
4 7 [3, 6, 9, 11, 12, 15, 16]
5 8 [3, 6, 9, 11, 12, 14, 15, 16]
5 9 [3, 6, 9, 11, 12, 14, 15, 16, 17]
5 10 [3, 6, 7, 9, 11, 12, 14, 15, 16, 17]
5 11 [3, 6, 7, 9, 11, 12, 14, 15, 16, 17, 18]
5 12 [3, 5, 6, 7, 9, 11, 12, 14, 15, 16, 17, 18]
5 13 [3, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18]
6 14 [3, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18]
6 15 [0, 3, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18]
6 16 [0, 2, 3, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18]
6 17 [0, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18]
6 18 [-1, 0, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18]
6 19 [-1, 0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]


It doesn't give you the structure, but it does show that it seems to be 
growing reasonably. Cheers.


Duncan


--
http://mail.python.org/mailman/listinfo/python-list


Re: Red Black Tree implementation?

2013-05-11 Thread duncan smith

On 12/05/13 02:29, Dan Stromberg wrote:


On Sat, May 11, 2013 at 4:24 PM, Dan Stromberg mailto:drsali...@gmail.com>> wrote:


I'm afraid I'm having some trouble with the module.  I've checked it
into my SVN at
http://stromberg.dnsalias.org/svn/red-black-tree-mod/trunk/duncan

I have two versions of your tests in there now - "t" is minimally
changed, and test-red_black_tree_mod is pretty restructured to
facilitate adding more tests later.  I get the same problem with
either version of the tests.

The problem I'm seeing is that the tree, when built from items,
isn't looking quite right.  I inserted a print(tree) into the for
loop, and I'm getting the following, where I expected the tree to
grow by one element on each iteration:

$ python t
6 False None None
6 False 3 None
6 False 3 15
6 False 3 15

I figured out that this was printing a single node and some of its
attributes, not an entire tree.  I changed it to print an entire tree
using self.in_order().


Yes, I've just posted regarding that.



I've also changed around the comparisons a bit, to use a __cmp__ method
but still provide __eq__, __neq__ and a new __lt__.



I have implemented a lot (maybe all?) of the set methods in a subclass. 
I should probably root that out and have a think about what should be in 
the RedBlackTree class and what subclasses might look like.




I'm up against a new problem now that it'd be nice if you could look at:
In BinaryTree.find(), it sometimes compares the item being searched for
against None.  In 2.x, this gives strange results, but may be benign in
this code.  In 3.x, this raises an exception.  I've added a comment
about this in the SVN repo I mentioned above.

You can see the traceback yourself with python3 test-red_black_tree_mod .

What should BinaryTree.find() do if it finds a data.node that is None?



A call to "find(data)" should find and return either a node containing 
"data"; or the sentinel node where "data" should be added. It should not 
get as far as the left or right child of a sentinel node (which would 
equal None). I'll look at this tomorrow. I did have the truth value of a 
node depending on it's data value (None implying False). Then I 
considered the possibility of actually wanting None as a value in the 
tree and changed it, so I could have introduced a bug here.



Thanks!

PS: Is it about time we moved this discussion off python-list?



Maybe. You have my official e-mail address. Cheers.

Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Red Black Tree implementation?

2013-05-12 Thread duncan smith

On 12/05/13 03:02, duncan smith wrote:

On 12/05/13 02:29, Dan Stromberg wrote:


On Sat, May 11, 2013 at 4:24 PM, Dan Stromberg mailto:drsali...@gmail.com>> wrote:


[snip]



What should BinaryTree.find() do if it finds a data.node that is None?



A call to "find(data)" should find and return either a node containing
"data"; or the sentinel node where "data" should be added. It should not
get as far as the left or right child of a sentinel node (which would
equal None). I'll look at this tomorrow. I did have the truth value of a
node depending on it's data value (None implying False). Then I
considered the possibility of actually wanting None as a value in the
tree and changed it, so I could have introduced a bug here.



It's a Python3 thing. The initial sentinel node was evaluating to True. 
__nonzero__ needs to be changed to __bool__.



Thanks!

PS: Is it about time we moved this discussion off python-list?



Let's do that from now.

Duncan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Ordered dictionaries compared

2013-05-23 Thread duncan smith

On 23/05/13 04:31, Dan Stromberg wrote:


What kind of ordered dictionaries?  Sorted by key.

I've redone the previous comparison, this time with a better red-black
tree implementation courtesy of Duncan G. Smith.

The comparison is at
http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/just-trees/

The Red-Black tree gave a much better showing this time, but it gave
just one 2nd place on one workload-interpreter - still kinda
lackluster.  It took 1st place 0 times.




A quick test of my Red Black Tree and Treap (Python 2.7).


>>> def test_trees(data, randomize=True):
cpy = data[:] # for deletion
if randomize:
random.shuffle(data)
random.shuffle(cpy)
t = binary_trees.RedBlackTree()
start = time.time()
for datum in data:
t.insert(datum)
print 'Red Black Tree insertion %s' % (time.time() - start)
start = time.time()
for datum in data:
t.find(datum)
print 'Red Black Tree find %s' % (time.time() - start)
start = time.time()
for datum in cpy:
t.delete(datum)
print 'Red Black Tree deletion %s' % (time.time() - start)
t = binary_trees.Treap()
start = time.time()
for datum in data:
t.insert(datum)
print
print 'Treap insertion %s' % (time.time() - start)
start = time.time()
for datum in data:
t.find(datum)
print 'Treap find %s' % (time.time() - start)
start = time.time()
for datum in cpy:
t.delete(datum)
print 'Treap deletion %s' % (time.time() - start)


>>> test_trees(range(10))
Red Black Tree insertion 5.42807197571
Red Black Tree find 1.58799219131
Red Black Tree deletion 3.87580800056

Treap insertion 6.79647684097
Treap find 2.11693120003
Treap deletion 4.61243915558
>>>
>>> test_trees(range(10), False)
Red Black Tree insertion 6.29647898674
Red Black Tree find 1.157143116
Red Black Tree deletion 2.74785804749

Treap insertion 3.87288999557
Treap find 1.48776102066
Treap deletion 1.88962197304
>>>


RBT is quicker than Treap for insertion with randomized data, but slower 
with ordered data. Randomized data will tend to minimize the number of 
tree rotations needed to keep the RBT balanced, whilst the Treap will be 
performing rotations to maintain the heap property in an already 
reasonably well balanced tree. With ordered data the RBT will have to 
work harder to keep the tree balanced, whilst the Treap will be able to 
maintain the heap property with fewer rotations.


No surprise that find() is generally quicker for RBTs, they tend to be 
better balanced.


Deletion is a bit more confusing. I suppose deletion from a better 
balanced tree will tend to be quicker, but deletion from a treap 
constructed from ordered data is (for some reason) quickest of all.


All these operations require a call to find(), and that is generally 
going to be quicker for RBTs. Treaps tend to require fewer subsequent 
rotations, but they have variable worth (in terms of rebalancing).


Looks like RBTs are better than treaps if they are being populated with 
randomly ordered data, but not if they are being populated with ordered 
data. RBTs are better for use cases that are heavy on finds.


Both types of tree appear to be better balanced (on the basis of the 
find results) if populated from ordered data. Treaps appear to perform 
better on insertion, find and deletion when populated from ordered data.


Duncan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Ordered dictionaries compared

2013-05-23 Thread duncan smith

On 23/05/13 18:44, Dan Stromberg wrote:


On Thu, May 23, 2013 at 9:41 AM, duncan smith mailto:buzzard@invalid.invalid>> wrote:


RBT is quicker than Treap for insertion with randomized data, but
slower with ordered data. Randomized data will tend to minimize the
number of tree rotations needed to keep the RBT balanced, whilst the
Treap will be performing rotations to maintain the heap property in
an already reasonably well balanced tree. With ordered data the RBT
will have to work harder to keep the tree balanced, whilst the Treap
will be able to maintain the heap property with fewer rotations.

No surprise that find() is generally quicker for RBTs, they tend to
be better balanced.

Deletion is a bit more confusing. I suppose deletion from a better
balanced tree will tend to be quicker, but deletion from a treap
constructed from ordered data is (for some reason) quickest of all.

All these operations require a call to find(), and that is generally
going to be quicker for RBTs. Treaps tend to require fewer
subsequent rotations, but they have variable worth (in terms of
rebalancing).

Looks like RBTs are better than treaps if they are being populated
with randomly ordered data, but not if they are being populated with
ordered data. RBTs are better for use cases that are heavy on finds.

Both types of tree appear to be better balanced (on the basis of the
find results) if populated from ordered data. Treaps appear to
perform better on insertion, find and deletion when populated from
ordered data.

Strange.  I was comparing randomized data (95% get, 50-50 get and set,
95% set) when I found that treaps were quite a bit faster than red black
trees.

The code I used is here:
http://stromberg.dnsalias.org/svn/python-tree-and-heap-comparison/trunk/

See also
https://en.wikipedia.org/wiki/Binary_search_tree#Performance_comparisons
, which found that treaps were faster on average the red black trees.




Dan,
Faster on average, but it depends what you're averaging over. As 
far as insertion and deletions are concerned my results agree with those 
in the paper, except they have treaps performing slightly faster than 
RBTs for insertion with randomly ordered data.


Deletion in your code is slightly different to that in mine. It might 
make a difference. Also, your code doesn't use sentinels (pros and 
cons). It could be down to implementation details.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Simple algorithm question - how to reorder a sequence economically

2013-05-24 Thread duncan smith

On 24/05/13 10:11, Chris Angelico wrote:

On Fri, May 24, 2013 at 6:47 PM, Fábio Santos  wrote:


On 24 May 2013 09:41, "Chris Angelico"  wrote:


On Fri, May 24, 2013 at 6:14 PM, Peter Brooks
 wrote:

What is the easiest way to reorder a sequence pseudo-randomly?

That is, for a sequence 1,2,3,4 to produce an arbitrary ordering (eg
2,1,4,3) that is different each time.


...


It works, it produces a unique list for any given index provided, but
it's not the cleanest or most efficient. But I know someone'll improve
on it... or tell me I'm an idiot for not taking a more obvious
approach :)

ChrisA


I think that is pretty much itertools.permutations from the standard
library. The OP should check it out.


That works if all the permutations are wanted at once. Is there a way,
short of iterating over it N times, to request permutation #N? Or
maybe I'm misreading the OP and that's not a requirement.

ChrisA



A long time ago I wrote some code to do that.


import gmpy

def LexPermFromIndex(items, index):
n = len(items)
inds = range(n)
perm = []
for i in range(1, n+1):
r, index = divmod(index, gmpy.fac(n-i))
r = int(r)
perm.append(inds[r])
inds = inds[:r] + inds[r+1:]

return [items[i] for i in perm]


>>> LexPermFromIndex([1,2,3,4], 0)
[1, 2, 3, 4]
>>> LexPermFromIndex([1,2,3,4], 1)
[1, 2, 4, 3]
>>> LexPermFromIndex([1,2,3,4], 10)
[2, 4, 1, 3]
>>>


I can't remember exactly why I wrote it. But I also have something for 
generating a permutation's index and similar functions for combinations.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


emded revision control in Python application?

2012-06-22 Thread duncan smith

Hello,
  I have an application that would benefit from collaborative 
working. Over time users construct a "data environment" which is a 
number of files in JSON format contained in a few directories (in the 
future I'll probably place these in a zip so the environment is 
contained within a single file). At the moment there is one individual 
constructing the data environment, and me occasionally applying 
corrections after being e-mailed the files. But in the future there 
might be several individuals in various locations.


As a minimum requirement I need to embed some sort of version control, 
so that changes committed by one individual will be seen in the local 
environments of the others. Some of the work involves editing graphs 
which have restrictions on their structure. In this case it would be 
useful for edits to be committed / seen in real time. The users will not 
be particularly technical, so the version control will have to happen 
relatively quietly in the background.


My immediate thoughts are to (somehow) embed Mercurial or Subversion. It 
would certainly be useful to be able to revert to a previous version of 
the data environment if an individual does something silly. But I'm not 
actually convinced that this is the whole solution for collaborative 
working. Any advice regarding the embedding of a version control system 
or alternative approaches would be appreciated. I haven't tried anything 
like this before. The desktop application is written in Python (2.6) 
with a wxPython (2.8) GUI. Given the nature of the application / data 
the machines involved might be locally networked but without web access 
(if this makes a difference). TIA.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: emded revision control in Python application?

2012-06-22 Thread duncan smith

On 22/06/12 17:42, Emile van Sebille wrote:

On 6/22/2012 8:58 AM duncan smith said...

Hello,
I have an application that would benefit from collaborative working.
Over time users construct a "data environment" which is a number of
files in JSON format contained in a few directories


You don't say what your target platform is, but on linux I've done some
testing with python-fuse that allows interception on file access to take
whatever actions you like, in your case archive prior upon write.

Might be worth a look.

Emile



I develop on Linux, but most users would be running some flavour of 
Windows. Initially I'd like to get something up and running that would 
allow me to collaborate from an Ubuntu box at home with someone using a 
Windows machine (not sure which version) in an office at the University 
of Manchester. The most likely end users would (probably) be running 
Windows machines on a local network with no internet access.


I expect it would generally be possible to have an always-on server, but 
I'm also thinking about peer to peer style communication (information 
might not always be completely available, but it's better than being 
totally unaware of changes being made by others).


I don't have much experience getting applications to communicate across 
a network, particularly in a reasonably secure fashion. Someone I know 
also suggested RabbitMQ. Any pointers that help me to reduce the options 
to a manageable number of candidates will be appreciated. A shallow 
learning curve would also be good (given that ATM this is an idea I want 
to try out rather than paid work). I am looking at fuse at the moment. 
Thanks.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: emded revision control in Python application?

2012-06-22 Thread duncan smith

On 22/06/12 21:34, Emile van Sebille wrote:

On 6/22/2012 11:19 AM duncan smith said...

On 22/06/12 17:42, Emile van Sebille wrote:

On 6/22/2012 8:58 AM duncan smith said...

Hello,
I have an application that would benefit from collaborative working.
Over time users construct a "data environment" which is a number of
files in JSON format contained in a few directories


So, will the users modify their local environment and you'd push the
revisions to a known location for redistribution?


Yes. My rough idea is that each time a user opens the application it 
will connect to a server and download the current data environment (or 
preferably just the changes made since the application was last 
connected). Thus the user can start with an up to date environment. As 
the application is used to make changes to the data environment any 
changes are uploaded to the server for immediate redistribution to other 
connected application instances.


Part of the application involves the construction of directed acyclic 
graphs. If I add an edge to a graph I want anyone else editing the same 
graph to be able to see the edge in something approaching real time so 
that cycles are avoided. (Being able to lock the file so that only one 
user can edit it concurrently might be another solution to this specific 
issue.)


How might peer-to-peer

work? How would you know which peers get the change, or would all peers
get the change?



All peers. I'm not sure about the peer to peer thing though. It would be 
better if the user could be guaranteed that the environment they see is 
current, rather than having changes residing on someone else's machine 
that happens to be switched off. I suppose the alternative must be that 
the information is sat on a server somewhere.



I've been working with rpyc (in as much spare time as I can manage) on
some similar sounding issues and am now settling on a central system
which provides convenient administration and potential relaying or
pulling. See http://rpyc.sourceforge.net/

I just tested my in-process development status and find 64 remote
machines up and 5 non-responsive which in my case are likely machines
that are not yet configured properly. As this has been on the back
burner the past two months I'm pleased with how it's fared in the face
of neglect.

At least with rpyc (which does have a low learning curve) you'll be
fully in python.



Yes, it looks very interesting. Cheers.

Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: emded revision control in Python application?

2012-06-23 Thread duncan smith

On 23/06/12 06:45, rusi wrote:

On Jun 22, 8:58 pm, duncan smith
wrote:

Hello,
I have an application that would benefit from collaborative
working. Over time users construct a "data environment" which is a
number of files in JSON format contained in a few directories (in the
future I'll probably place these in a zip so the environment is
contained within a single file). At the moment there is one individual
constructing the data environment, and me occasionally applying
corrections after being e-mailed the files. But in the future there
might be several individuals in various locations.

As a minimum requirement I need to embed some sort of version control,
so that changes committed by one individual will be seen in the local
environments of the others. Some of the work involves editing graphs
which have restrictions on their structure. In this case it would be
useful for edits to be committed / seen in real time. The users will not
be particularly technical, so the version control will have to happen
relatively quietly in the background.

My immediate thoughts are to (somehow) embed Mercurial or Subversion. It
would certainly be useful to be able to revert to a previous version of
the data environment if an individual does something silly. But I'm not
actually convinced that this is the whole solution for collaborative
working. Any advice regarding the embedding of a version control system
or alternative approaches would be appreciated. I haven't tried anything
like this before. The desktop application is written in Python (2.6)
with a wxPython (2.8) GUI. Given the nature of the application / data
the machines involved might be locally networked but without web access
(if this makes a difference). TIA.

Duncan


If you are looking at mercurial and subversion you may want to look at
git also.

 From http://en.wikipedia.org/wiki/Git_%28software%29#Implementation
(quoting Linus Torvalds)
---
In many ways you can just see git as a filesystem — it's content-
addressable, and it has a notion of versioning, but I really really
designed it coming at the problem from the viewpoint of a filesystem
person (hey, kernels is what I do), and I actually have absolutely
zero interest in creating a traditional SCM system.

More details https://git.wiki.kernel.org/index.php/Git#Design
-
Of course its good to say upfront that git is mostly C+shell ie its
not python
There is gitpython http://packages.python.org/GitPython/0.1/tutorial.html
but I know nothing about it


Thanks. I'm trying to figure out whether I'm better of with a version 
control system, a virtual filesystem (e.g. 
http://code.google.com/p/pyfilesystem/), remote procedure calls or some 
combination of these.


What I really need is a flexible framework that I can experiment with, 
as it's not clear what the best strategy for collaborative working might 
be. e.g. It might be best to restrict working on certain elements of the 
data environment to a single individual. Cheers.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: creating an artificial "last element" in sort list

2012-09-28 Thread duncan smith

On 29/09/12 00:51, dave wrote:

more clearer, this is a more realistic use case:

['awefawef', 'awefawfsf', 'awefsdf', 'zz', 'zz', 
'zz']

and the quantity of ''zz'' would be dynamic.



Maybe,

class Greatest:

def __lt__(self, other):
return False

def __eq__(self, other):
return type(other) == type(self)

etc.

Duncan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Strange object identity problem

2012-11-12 Thread duncan smith

On 12/11/12 13:40, F.R. wrote:

On 11/12/2012 02:27 PM, Robert Franke wrote:

Hi Frederic,

[...]


bas = {}
for year in range (2010, 2013):

 ba = st.runs ('BA', '%d-01-01' % year, '%d-12-31' % year)
 ba.run ()
print year, id (ba)
 bas [year] = ba

2010 150289932
2011 150835852
2012 149727788


for y in sorted (bas.keys ()):

 b = bas [year]

Shouldn't that be b = bas[y]?

Yes, it should, indeed! What's more, I should have closed and restarted
IDLE. There must have
been a name clash somewhere in the name space. The problem no longer
exists. Sorry
about that. And thanks to all who paused to reflect on this non-problem.
- Frederic.





 print y, id (b)

2010 149727788
2011 149727788
2012 149727788


[...]

Cheers,

Robert





The problem was that year was bound to the integer 2013 from the first 
loop. When you subsequently looped over the keys you printed each key 
followed by id(bas[2013]). Restarting IDLE only helped because you 
presumably didn't repeat the error.


Duncan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Exponential arrival distribution in Python

2012-11-29 Thread duncan smith

On 28/11/12 21:34, Ricky wrote:


Hi all,

I am doing a project on traffic simulation. I want to introduce exponential 
arrival distribution to precede this task. Therefore I want write a code in 
python for exponential arrival distribution. I am very new for programming and 
if anybody can help me on this that would be great.

Cheers,
Ricky



Maybe you mean something like,

>>> from random import expovariate
>>> expovariate(1)
0.09133428954823608
>>> expovariate(1)
2.5388809393383407
>>>

Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: String manipulation in python..NEED HELP!!!!

2012-12-11 Thread duncan smith

On 10/12/12 22:38, qbai...@ihets.org wrote:

I need help with a program i am doing. it is a cryptography program. i am given 
a regular alphabet and a key. i need to use the user input and use the regular 
alphabet and use the corresponding letter in the key and that becomes the new 
letter. i have the basic code but need help with how to mainpulate the string 
to do the encryption/decryption. please help

here is my code so far:


""" crypto.py
 Implements a simple substitution cypher
"""

alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
key =   "XPMGTDHLYONZBWEARKJUFSCIQV"

def main():
   keepGoing = True
   while keepGoing:
 response = menu()
 if response == "1":
   plain = raw_input("text to be encoded: ")
   print encode(plain)
 elif response == "2":
   coded = raw_input("code to be decyphered: ")
   print decode(coded)
 elif response == "0":
   print "Thanks for doing secret spy stuff with me."
   keepGoing = False
 else:
   print "I don't know what you want to do..."




i really need help on how to encrypt it im not sure how to go about doing that 
please help.



>>> alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
>>> key = "XPMGTDHLYONZBWEARKJUFSCIQV"
>>> mapping = {}
>>> for i, ch in enumerate(alpha):
mapping[ch] = key[i]


>>> ''.join(mapping[ch] for ch in "ACE")
'XMT'
>>> ''.join(mapping[ch] for ch in "WORD")
'CEKG'
>>>


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Probabilistic unit tests?

2013-01-11 Thread duncan smith

On 11/01/13 01:59, Nick Mellor wrote:

Hi,

I've got a unit test that will usually succeed but sometimes fails. An 
occasional failure is expected and fine. It's failing all the time I want to 
test for.

What I want to test is "on average, there are the same number of males and females 
in a sample, give or take 2%."

Here's the unit test code:
import unittest
from collections import counter

sex_count = Counter()
for contact in range(self.binary_check_sample_size):
 p = get_record_as_dict()
 sex_count[p['Sex']] += 1
self.assertAlmostEqual(sex_count['male'],
sex_count['female'],
delta=sample_size * 2.0 / 100.0)

My question is: how would you run an identical test 5 times and pass the group 
*as a whole* if only one or two iterations passed the test? Something like:

 for n in range(5):
 # self.assertAlmostEqual(...)
 # if test passed: break
 else:
 self.fail()

(except that would create 5+1 tests as written!)

Thanks for any thoughts,

Best wishes,

Nick



The appropriateness of "give or take 2%" will depend on sample size. 
e.g. If the proportion of males should be 0.5 and your sample size is 
small enough this will fail most of the time regardless of whether the 
proportion is 0.5.


What you could do is perform a statistical test. Generally this involves 
generating a p-value and rejecting the null hypothesis if the p-value is 
below some chosen threshold (Type I error rate), often taken to be 0.05. 
Here the null hypothesis would be that the underlying proportion of 
males is 0.5.


A statistical test will incorrectly reject a true null in a proportion 
of cases equal to the chosen Type I error rate. A test will also fail to 
reject false nulls a certain proportion of the time (the Type II error 
rate). The Type II error rate can be reduced by using larger samples. I 
prefer to generate several samples and test whether the proportion of 
failures is about equal to the error rate.


The above implies that p-values follow a [0,1] uniform density function 
if the null hypothesis is true. So alternatively you could generate many 
samples / p-values and test the p-values for uniformity. That is what I 
generally do:



p_values = []
for _ in range(numtests):
values = data generated from code to be tested
p_values.append(stat_test(values))
test p_values for uniformity


The result is still a test that will fail a given proportion of the 
time. You just have to live with that. Run your test suite several times 
and check that no one test is "failing" too regularly (more often than 
the chosen Type I error rate for the test of uniformity). My experience 
is that any issues generally result in the test of uniformity being 
consistently rejected (which is why a do that rather than just 
performing a single test on a single generated data set).


In your case you're testing a Binomial proportion and as long as you're 
generating enough data (you need to take into account any test 
assumptions / approximations) the observed proportions will be 
approximately normally distributed. Samples of e.g. 100 would be fine. 
P-values can be generated from the appropriate normal 
(http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval), 
and uniformity can be tested using e.g. the Kolmogorov-Smirnov or 
Anderson-Darling test 
(http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).


I'd have thought that something like this also exists somewhere. How do 
people usually test e.g. functions that generate random variates, or 
other cases where deterministic tests don't cut it?


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Probabilistic unit tests?

2013-01-12 Thread duncan smith

On 12/01/13 08:07, alex23 wrote:

On 11 Jan, 13:34, Steven D'Aprano  wrote:

Well, that's not really a task for unit testing. Unit tests, like most
tests, are well suited to deterministic tests, but not really to
probabilistic testing. As far as I know, there aren't really any good
frameworks for probabilistic testing, so you're stuck with inventing your
own. (Possibly on top of unittest.)


One approach I've had success with is providing a seed to the RNG, so
that the random results are deterministic.



My ex-boss once instructed to do the same thing to test functions for 
generating random variates. I used a statistical approach instead.


There are often several ways of generating data that follow a particular 
distribution. If you use a given seed so that you get a deterministic 
sequence of uniform random variates you will get deterministic outputs 
for a specific implementation. But if you change the implementation the 
tests are likely to fail. e.g. To generate a negative exponential 
variate -ln(U)/lambda or -ln(1-U)/lambda will do the job correctly, but 
tests for one implementation would fail with the other. So each time you 
changed the implementation you'd need to change the tests.


I think my boss had in mind that I would write the code, seed the RNG, 
call the function a few times, then use the generated values in the 
test. That would not even have tested the original implementation. I 
would have had a test that would only have tested whether the 
implementation had changed. I would argue, worse than no test at all. If 
I'd gone to the trouble of manually calculating the expected outputs so 
that I got valid tests for the original implementation, then I would 
have had a test that would effectively just serve as a reminder to go 
through the whole manual calculation process again for any changed 
implementation.


A reasonably general statistical approach is possible. Any hypothesis 
about generated data that lends itself to statistical testing can be 
used to generate a sequence of p-values (one for each set of generated 
values) that can be checked (statistically) for uniformity. This 
effectively tests the distribution of the test statistic, so is better 
than simply testing whether tests on generated data pass, say, 95% of 
the time (for a chosen 5% Type I error rate). Cheers.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Iterate from 2nd element of a huge list

2012-01-31 Thread duncan smith

On 01/02/12 01:39, Paulo da Silva wrote:

Hi!

What is the best way to iterate thru a huge list having the 1st element
a different process? I.e.:

process1(mylist[0])
for el in mylist[1:]:
process2(el)

This way mylist is almost duplicated, isn't it?

Thanks.


Maybe (untested),

it = iter(mylist)
process1(it.next())
for el in it:
process2(el)


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Distribution

2012-03-20 Thread duncan smith

On 20/03/12 04:31, prince.pangeni wrote:

Hi all,
I am doing a simulation project using Python. In my project, I want
to use some short of distribution to generate requests to a server.
The request should have two distributions. One for request arrival
rate (should be poisson) and another for request mix (i.e. out of the
total requests defined in request arrival rate, how many requests are
of which type).
Example: Suppose the request rate is - 90 req/sec (generated using
poisson distribution) at time t and we have 3 types of requests (i.e.
r1, r2, r2). The request mix distribution output should be similar to:
{r1 : 50 , r2 : 30 , r3 : 10} (i.e. out of 90 requests - 50 are of r1
type, 30 are of r2 type and 10 are of r3 type).
As I an new to python distribution module, I am not getting how to
code this situation. Please help me out for the same.

   Thanks in advance

Prince


Robert has given you a very good answer. The easiest way is to generate 
interarrival times using an exponential distribution, then for each 
event select the type from a categorical probability mass function. 
Perhaps the easiest and most efficient approach for the latter using 
your 'mix distribution' above is to create a list containing 5 instances 
of r1, 3 of r2 and 1 of r3. Then select the type by generating a random 
index into the list. It is not an ideal solution generally, but good 
when the parameters do not change and the required list is small.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Newbie, homework help, please.

2012-04-21 Thread duncan smith

On 21/04/12 23:48, BartC wrote:

"someone"  wrote in message
news:9533449.630.1335042672358.JavaMail.geo-discussion-forums@ynmf4...

On Saturday, April 21, 2012 3:44:49 PM UTC-5, BartC wrote:



Hi, Bart: Thank you, your post is working now, maybe, I did something
wrong, unfortunately, you are right, my setup for getting the file to
pull
up correctly now is an issue, At first, I got a Vertical line with it
working, then I tried to tinker with it, and it fratched, lol
def border(text):
maxwidth=0
for s in text:
if len(s)>maxwidth: maxwidth=len(s)
vertinchlines=6 # assume 6 lines/inch
hozinchchars=10 # assume 10 chars/inch
hozmargin=" "*hozinchchars
newtext=[]
for i in range(vertinchlines):
newtext.append("")
newtext.append(hozmargin+"*"*(maxwidth+4)+hozmargin)
newtext.append(hozmargin+"* "+" "*maxwidth+" *"+hozmargin)
for s in text:
newtext.append(hozmargin+"* "+s+" "*(maxwidth-len(s))+" *"+hozmargin)
newtext.append(hozmargin+"* "+" "*maxwidth+" *"+hozmargin)
newtext.append(hozmargin+"*"*(maxwidth+4)+hozmargin)
for i in range(vertinchlines):
newtext.append("")
return newtext
x=textfile;indat=open(x,'r');SHI=indat.read()
textTuple = border(SHI)
for lines in textTuple:
print ("%s\n" %textTuple)


The issue is trying to get the SHI to work right, but omg, this is the
closes I have gotten, you are awsome, thank you very much, i guess i will
just keep messing with it till i get it


I had to use this code to make this work right from a file (in additon
to the border() function):

textfile="kkk4" # (don't use this; this was my test input)

x=textfile;indat=open(x,'r');

SHI=indat.readlines()
indat.close()

for i in range(len(SHI)): # remove trailing '\n' from each line
s=SHI[i]
SHI[i]=(s[0:len(s)-1])

textTuple = border(SHI)

for lines in textTuple:
print (lines)

Your indat.read() seemed to read all the lines as one long string. I used
indat.readlines() instead. However each line has a newline char '\n' at the
end. I put in a loop to get rid of that (I'm sure there's a one-line fix to
do that, but as I said don't know Python).



[snip]

line = line.rstrip()

would do the job. And as file objects are iterable,

indat = open(x,'r')
SHI = [line.rstrip() for line in indat]
indat.close()

textTuple = border(SHI)

etc.

Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: usenet reading

2012-05-26 Thread duncan smith

On 25/05/12 23:38, Jon Clements wrote:

Hi All,

Normally use Google Groups but it's becoming absolutely frustrating - not only 
has the interface changed to be frankly impractical, the posts are somewhat 
random of what appears, is posted and whatnot. (Ironically posted from GG)

Is there a server out there where I can get my news groups?


If you don't mind paying a small fee there are several companies 
providing usenet access such as http://www.newsdemon.com. (I didn't have 
much joy trying to track down a reliable free service, so now I pay a 
few pounds a year.)


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


sqlite INSERT performance

2012-05-30 Thread duncan smith

Hello,
  I have been attempting to speed up some code by using an sqlite 
database, but I'm not getting the performance gains I expected.


The use case:

I have text files containing data which may or may not include a header 
in the first line. Each line (other than the header) is a record, so all 
lines (when split on the relevant separator) should contain the same 
number of values. I need to generate new files in a very specific 
format; space separated, with header removed and integer codes 
substituted for the values in the parent file. e.g. If the first column 
(i.e. [line.strip('\r\n').split()[0] for line in file]) contained 15 
distinct strings, then I would substitute the values in the parent file 
with integers from 0 to 14 in the new file. The new file would contain a 
non-empty subset of the 'columns' in the original file, and might be 
conditioned on particular values of other columns.


My first effort read the parent file and generated a similar file 
containing integer codes. New files were generated by iterating over the 
lines of the file containing integer codes, splitting them, doing the 
required selection and conditioning via list comprehensions, joining the 
resultant lists, and writing to a new file. My test file has 67 columns 
and over a million records, and just creating the file of integers took 
a few minutes. (I also need to check for empty lines and skip them, and 
check for records of incorrect length.)


I have partially implemented an alternative approach where I write the 
data to an sqlite database. The idea is that I will add extra columns 
for the integer codes and insert the integer codes only when required 
for a new file. But I've been immediately hit with the cost of inserting 
the data into the database. It takes around 80 seconds (compared to the 
35 seconds needed to parse the original file and skip empty lines and 
check the record lengths). I have tried iterating over the records 
(lists of strings generated by csv.reader) and inserting each in turn. I 
have also tried executemany() passing the csv.reader as the second 
argument. I have also tried executing "PRAGMA synchronous=OFF". It still 
takes around 80 seconds.


I'm a bit rusty with SQL, so I'd appreciate any advice on how to speed 
this up. I seem to remember (using MySQL years ago) that there was a way 
of dumping data in a text file to a table very quickly. If I could do 
this and do my data integrity checks afterwards, then that would be 
great. (Dumping data efficiently to a text file from an sqlite table 
would also be handy for generating my new files.) Alternatively, if I 
could substantially speed up the inserts then that would be great. Any 
advice appreciated. TIA.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: sqlite INSERT performance

2012-05-31 Thread duncan smith

On 31/05/12 06:15, John Nagle wrote:

On 5/30/2012 6:57 PM, duncan smith wrote:

Hello,
I have been attempting to speed up some code by using an sqlite
database, but I'm not getting the performance gains I expected.


SQLite is a "lite" database. It's good for data that's read a
lot and not changed much. It's good for small data files. It's
so-so for large database loads. It's terrible for a heavy load of
simultaneous updates from multiple processes.



Once the table is created the data will not be changed at all. 
Corresponding integer codes will have to be generated for columns. (I 
want to do this lazily because some columns might never be needed for 
output files, and processing all columns was relatively expensive for my 
initial solution.) After that it's a series of 'SELECT a, b, ... FROM 
table WHERE f="g" ORDER by a, b, ...' style queries dumped to space 
separated text files.



However, wrapping the inserts into a transaction with BEGIN
and COMMIT may help.



Unfortunately there's no discernible difference.


If you have 67 columns in a table, you may be approaching the
problem incorrectly.



Quite possibly. I have defined start and end points. The data are 
contained in text files. I need to do the mapping to integer codes and 
generate output files for subsets of variables conditional on the levels 
of other variables. (I was doing the subsequent sorting separately, but 
if I'm using SQL I guess I might as well include that in the query.) The 
output files are inputs for other (C++) code that I have no control over.


Any approach that doesn't consume large amounts of memory will do. Cheers.

Duncan

--
http://mail.python.org/mailman/listinfo/python-list


Re: sqlite INSERT performance

2012-05-31 Thread duncan smith

On 31/05/12 17:06, Jon Clements wrote:

On Thursday, 31 May 2012 16:25:10 UTC+1, duncan smith  wrote:

On 31/05/12 06:15, John Nagle wrote:

On 5/30/2012 6:57 PM, duncan smith wrote:

Hello,
I have been attempting to speed up some code by using an sqlite
database, but I'm not getting the performance gains I expected.


SQLite is a "lite" database. It's good for data that's read a
lot and not changed much. It's good for small data files. It's
so-so for large database loads. It's terrible for a heavy load of
simultaneous updates from multiple processes.



Once the table is created the data will not be changed at all.
Corresponding integer codes will have to be generated for columns. (I
want to do this lazily because some columns might never be needed for
output files, and processing all columns was relatively expensive for my
initial solution.) After that it's a series of 'SELECT a, b, ... FROM
table WHERE f="g" ORDER by a, b, ...' style queries dumped to space
separated text files.


However, wrapping the inserts into a transaction with BEGIN
and COMMIT may help.



Unfortunately there's no discernible difference.


If you have 67 columns in a table, you may be approaching the
problem incorrectly.



Quite possibly. I have defined start and end points. The data are
contained in text files. I need to do the mapping to integer codes and
generate output files for subsets of variables conditional on the levels
of other variables. (I was doing the subsequent sorting separately, but
if I'm using SQL I guess I might as well include that in the query.) The
output files are inputs for other (C++) code that I have no control over.

Any approach that doesn't consume large amounts of memory will do. Cheers.

Duncan


It might be worth checking out https://sdm.lbl.gov/fastbit/ which has Python 
bindings (nb: the library itself takes a while to compile), but I'm not I00% 
sure it would meet all your requirements.

Jon


Interesting. It might actually be more useful for the types of query 
performed by the C++ part of the application. I tried something based on 
bitwise operations on integers but couldn't quite match the performance 
of the recursive algorithm implemented (by others) in C++. Cheers.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Wondering in the Python Forrest

2011-07-30 Thread duncan smith

Ken Watford wrote:

On Sat, Jul 30, 2011 at 7:13 AM, ray  wrote:

I found that structured data could be presented in Python using a module in
wxPython.

Where am I?  I do not know the relationships between the Pythons.  I
feel that I am missing something.  I started with Python as it has so
much functionality and a huge user group of very helpful individuals
and sites.  Now that I have delved deeper, it seems I need different
Pythons.


I think the name has caused you some confusion. wxPython is not a
different Python, it's a package for Python that displays GUI
components using the C++ "wxWidgets" library.

While there are other Pythons out there, for scientific work you
should have everything you need in the one you've got. You may have to
install an additional package now and then, but that's it.


And pythonxy (http://www.pythonxy.com/) is basically that very same 
Python bundled together with additional scientific packages to save the 
hassle of installing them separately.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: General Purpose Pipeline library?

2017-11-20 Thread duncan smith
On 20/11/17 15:48, Jason wrote:
> a pipeline can be described as a sequence of functions that are applied to an 
> input with each subsequent function getting the output of the preceding 
> function:
> 
> out = f6(f5(f4(f3(f2(f1(in))
> 
> However this isn't very readable and does not support conditionals.
> 
> Tensorflow has tensor-focused pipepines:
> fc1 = layers.fully_connected(x, 256, activation_fn=tf.nn.relu, 
> scope='fc1')
> fc2 = layers.fully_connected(fc1, 256, activation_fn=tf.nn.relu, 
> scope='fc2')
> out = layers.fully_connected(fc2, 10, activation_fn=None, scope='out')
> 
> I have some code which allows me to mimic this, but with an implied parameter.
> 
> def executePipeline(steps, collection_funcs = [map, filter, reduce]):
>   results = None
>   for step in steps:
>   func = step[0]
>   params = step[1]
>   if func in collection_funcs:
>   print func, params[0]
>   results = func(functools.partial(params[0], 
> *params[1:]), results)
>   else:
>   print func
>   if results is None:
>   results = func(*params)
>   else:
>   results = func(*(params+(results,)))
>   return results
> 
> executePipeline( [
>   (read_rows, (in_file,)),
>   (map, (lower_row, field)),
>   (stash_rows, ('stashed_file', )),
>   (map, (lemmatize_row, field)),
>   (vectorize_rows, (field, min_count,)),
>   (evaluate_rows, (weights, None)),
>   (recombine_rows, ('stashed_file', )),
>   (write_rows, (out_file,))
>   ]
> )
> 
> Which gets me close, but I can't control where rows gets passed in. In the 
> above code, it is always the last parameter.
> 
> I feel like I'm reinventing a wheel here.  I was wondering if there's already 
> something that exists?
> 

Maybe Kamaelia?

http://www.kamaelia.org/Home.html

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Problem with assignment. Python error or mine?

2017-12-21 Thread duncan smith
On 21/12/17 19:06, John Ladasky wrote:
> On Thursday, December 21, 2017 at 7:37:39 AM UTC-8, MRAB wrote:
> 
>> Python never makes a copy unless you ask it to.
>>
>> What x1=X does is make the name x1 refer to the same object that X 
>> refers to. No copying.
> 
> Well, except with very simple, mutable data types like scalars... compare 
> this:
> 
 x=5
 y=x
 x,y
> (5, 5)
 x+=1
 x,y
> (6, 5)
> 
> To this:
> 
 a=[1,2,3]
 b=a
 a,b
> ([1, 2, 3], [1, 2, 3])
 a[1]=9
 a,b
> ([1, 9, 3], [1, 9, 3])
> 

Except ints aren't mutable and there's still no copying.

For

x += 1

(where x is e.g. an int) read

x = x + 1

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Speeding up the implementation of Stochastic Gradient Ascent in Python

2018-01-18 Thread duncan smith
On 17/01/18 14:29, leutrim.kal...@gmail.com wrote:
> 
> Hello everyone, 
> 
> I am implementing a time-dependent Recommender System which applies BPR 
> (Bayesian Personalized Ranking), where Stochastic Gradient Ascent is used to 
> learn the parameters of the model. Such that, one iteration involves sampling 
> randomly the quadruple (i.e. userID, positive_item, negative_item, 
> epoch_index) for n times, where n is the total number of positive feedbacks 
> (i.e. the number of ratings given to all items). But, as my implementation 
> takes too much time for learning the parameters (since it requires 100 
> iterations to learn the parameters), I was wondering if there is a way to 
> improve my code and speed up the learning process of the parameters.
> 
> Please find the code of my implementation (the update function named as 
> updateFactors is the one that learns the parameters, and such that I guess is 
> the one that should be improved in order to speed up the process of learning 
> the parameter values) in the following link: 
> https://codereview.stackexchange.com/questions/183707/speeding-up-the-implementation-of-stochastic-gradient-ascent-in-python
> 





dim0 = []
dim1 = []
...
dim19 = []


dim0.append((asin, self.theta_item_per_bin[bin][itemID][0]))
dim1.append((asin,self.theta_item_per_bin[bin][itemID][1]))
...
dim19.append((asin,self.theta_item_per_bin[bin][itemID][19]))


for d in range(self.K2):
if d == 0:
max_value = max(dim0, key=operator.itemgetter(1))
asin_ = max_value[0]
value_ = max_value[1]
print 'dim:',d,', itemID: ', asin_, ', value = ', value_

if d == 1:
max_value = max(dim1, key=operator.itemgetter(1))
asin_ = max_value[0]
value_ = max_value[1]
print 'dim:',d,', itemID: ', asin_, ', value = ', value_



if d == 19:
max_value = max(dim19, key=operator.itemgetter(1))
asin_ = max_value[0]
value_ = max_value[1]
print 'dim:',d,', itemID: ', asin_, ', value = ', value_


How about something like,


dims = [[] for _ in range(20)]

for i in range(20):
dims[i].append((asin, self.theta_item_per_bin[bin][itemID][i]))

for j in range(self.K2):
max_value = max(dims[j], key=operator.itemgetter(1))
asin_ = max_value[0]
value_ = max_value[1]
print 'dim:', j,', itemID: ', asin_, ', value = ', value_


I haven't looked at it in detail, and I'm not sure why you have exactly
20 lists or what happens if self.K2 is not equal to 20. But you can make
the code simpler and shorter, and marginally more efficient by not
testing all those if statements.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: plot / graph connecting re ordered lists

2018-01-23 Thread duncan smith
On 23/01/18 23:42, Vincent Davis wrote:
> On Tue, Jan 23, 2018 at 4:15 PM Dennis Lee Bieber 
> wrote:
> 
>> On Tue, 23 Jan 2018 13:51:55 -0700, Vincent Davis
>>  declaimed the following:
>>
>>> Looking for suggestions. I have an ordered list of names these names will
>>> be reordered. I am looking to make a plot, graph, with the two origins of
>>
>> IE: you have two lists with the same items in different orders...
>>
>>> the names in separate columns and a line connecting them to visually
>>> represent how much they have moved in the reordering.
>>> Surely there is some great example code for this on the net an am not
>>> finding a clean example.
>>>
>>
>> Determine positions:
>>
>> pos = []
>> for p, name in enumerate(first_list):
>> np = second_list.index(name)
>> pos.append( (name, p, np) )
>>
>> for (name, p, np) in pos:
>> draw_line((1,p) , (2, np))
>> label( (1, p), name)
>>
>> Exact details of graphics package and scaling left as an exercise
> 
> 
> Actualy, it’s recomendations for a graphing package And an example using it
> for such a graph that I am most interested in. I know how to relate the
> names on the 2 lists.
> 
> 
>> --
>> Wulfraed Dennis Lee Bieber AF6VN
>> wlfr...@ix.netcom.comHTTP://wlfraed.home.netcom.com/
>>
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>

Maybe http://graphviz.org/ and http://matthiaseisen.com/articles/graphviz/

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How do you debug in Python? Coming from a Matlab and R user. I'm already aware of pdb.

2021-01-27 Thread duncan smith
On 27/01/2021 22:41, C W wrote:
> Great tutorial Irv, very simple with if-else example, gets the point
> across.
> 
> My main takeaway from the discussion so far is that: you can't troubleshoot
> Python without some kind of breakpoint or debugger.
> 

[snip]

Really?

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


mutating a deque whilst iterating over it

2021-02-11 Thread duncan smith
Hello,
  It seems that I can mutate a deque while iterating over it if I
assign to an index, but not if I append to it. Is this the intended
behaviour? It seems a bit inconsistent. Cheers.

Duncan

>>> from collections import deque
>>> d = deque(range(8))
>>> it = iter(d)
>>> next(it)
0
>>> d[1] = 78
>>> next(it)
78
>>> d.append(8)
>>> next(it)
Traceback (most recent call last):
  File "", line 1, in 
next(it)
RuntimeError: deque mutated during iteration
>>>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: mutating a deque whilst iterating over it

2021-02-13 Thread duncan smith
On 12/02/2021 03:04, Terry Reedy wrote:
> On 2/11/2021 3:22 PM, duncan smith wrote:
> 
>>    It seems that I can mutate a deque while iterating over it if I
>> assign to an index, but not if I append to it. Is this the intended
>> behaviour?
> 
> Does the deque doc say anything about mutation while iterating? (Knowing
> the author of deque, I would consider everything about it intentional
> without *good* reason to think otherwise.
> 
>>>>> from collections import deque
>>>>> d = deque(range(8))
>>>>> it = iter(d)
>>>>> next(it)
>> 0
>>>>> d[1] = 78
> 
> This does not change the structure of the deque, so next does not
> notice.  It could be considered not be a mutation.  It could be detected
> by changing deque.__setitem__, but why bother and slow down all
> __setitem__ calls.
> 
>>>>> next(it)
>> 78
>>>>> d.append(8)
> 
> This changes the structure, which can apparently mess-up iteration.
> 
>>>>> next(it)
>> Traceback (most recent call last):
>>    File "", line 1, in 
>>  next(it)
>> RuntimeError: deque mutated during iteration
>>>>>
> 
> 

What I was really wondering was whether the behaviour is as it is
because of the implementation or because it's how deques should ideally
behave. i.e. In my implementation do I stick strictly to the same API,
or allow it to differ? In some places I'm jumping through hoops to stick
to the API, and (when it comes to iteration) I'm jumping through
different hoops for different types of container (e.g. lists versus
deques). BTW, the reason I am implementing these at all is that my
containers are on-disk. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: mutating a deque whilst iterating over it

2021-02-14 Thread duncan smith
On 13/02/2021 19:12, Dan Stromberg wrote:
> On Sat, Feb 13, 2021 at 10:25 AM duncan smith 
> wrote:
> 
>> On 12/02/2021 03:04, Terry Reedy wrote:
>>> On 2/11/2021 3:22 PM, duncan smith wrote:
>>>
>>>>It seems that I can mutate a deque while iterating over it if I
>>>> assign to an index, but not if I append to it. Is this the intended
>>>> behaviour?
>>
>> What I was really wondering was whether the behaviour is as it is
>> because of the implementation or because it's how deques should ideally
>> behave. i.e. In my implementation do I stick strictly to the same API,
>> or allow it to differ? In some places I'm jumping through hoops to stick
>> to the API, and (when it comes to iteration) I'm jumping through
>> different hoops for different types of container (e.g. lists versus
>> deques). BTW, the reason I am implementing these at all is that my
>> containers are on-disk. Cheers.
>>
>> collections.deque appears to take the approach of  "allow everything we
> can based on our implementation, and trust the client not to overuse
> features".
> 
> In fact, in Python, "over abstraction" is often seen as a bad thing, mostly
> because it slows things down so much.
> 
> If you want something more abstracted, you might have a look at:
> https://stromberg.dnsalias.org/~strombrg/linked-list/
> https://pypi.org/project/linked_list_mod/
> 
> (Both URL's are about the same module)
> 
> Note that linked_list_mod is quite a bit slower than a collections.deque.
> Deques use a clever-but-hidden trick to gain a lot of speed - it's really a
> linked list of built in lists, which gives better locality of reference
> that masks CPU caches very happy.  linked_list_mod goes for abstraction and
> simplicity.
> 

I'm basically nicking the trick. I have a basic, single file, on-disk
deque class that doesn't stick strictly to the Python deque API, then a
second class that implements a deque as a Python deque containing
instances of my basic on-disk deque class (with fixed size apart from
the first and last instances). Of course, this could change when I get
to testing / profiling. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-08 Thread duncan smith

On 08/04/2022 08:21, Antoon Pardon wrote:



Op 8/04/2022 om 08:24 schreef Peter J. Holzer:

On 2022-04-07 17:16:41 +0200, Antoon Pardon wrote:

Op 7/04/2022 om 16:08 schreef Joel Goldstick:
On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon   
wrote:
I am working with a list of data from which I have to weed out 
duplicates.

At the moment I keep for each entry a container with the other entries
that are still possible duplicates.

[...]
Sorry I wasn't clear. The data contains information about persons. 
But not

all records need to be complete. So a person can occur multiple times in
the list, while the records are all different because they are missing
different bits.

So all records with the same firstname can be duplicates. But if I have
a record in which the firstname is missing, it can at that point be
a duplicate of all other records.

There are two problems. The first one is how do you establish identity.
The second is how do you ween out identical objects. In your first mail
you only asked about the second, but that's easy.

The first is really hard. Not only may information be missing, no single
single piece of information is unique or immutable. Two people may have
the same name (I know about several other "Peter Holzer"s), a single
person might change their name (when I was younger I went by my middle
name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
same person?), they will move (= change their address), change jobs,
etc. Unless you have a unique immutable identifier that's enforced by
some authority (like a social security number[1]), I don't think there
is a chance to do that reliably in a program (although with enough data,
a heuristic may be good enough).


Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all other
records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually 
containing

them. A range object almost works, with the only problem it is not
comparable with a list.



Is there any reason why you can't use ints? Just set the relevant bits.

Duncan
--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-09 Thread duncan smith

On 08/04/2022 22:08, Antoon Pardon wrote:


Op 8/04/2022 om 16:28 schreef duncan smith:

On 08/04/2022 08:21, Antoon Pardon wrote:


Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all 
other

records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually 
containing

them. A range object almost works, with the only problem it is not
comparable with a list.



Is there any reason why you can't use ints? Just set the relevant bits.


Well my first thought is that a bitset makes it less obvious to calulate
the size of the set or to iterate over its elements. But it is an idea
worth exploring.





def popcount(n):
"""
Returns the number of set bits in n
"""
cnt = 0
while n:
n &= n - 1
cnt += 1
return cnt

and not tested,

def iterinds(n):
"""
Returns a generator of the indices of the set bits of n
"""
i = 0
while n:
if n & 1:
yield i
n = n >> 1
i += 1


Duncan


--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-10 Thread duncan smith

On 10/04/2022 21:20, Antoon Pardon wrote:



Op 9/04/2022 om 02:01 schreef duncan smith:

On 08/04/2022 22:08, Antoon Pardon wrote:


Well my first thought is that a bitset makes it less obvious to calulate
the size of the set or to iterate over its elements. But it is an idea
worth exploring.





def popcount(n):
    """
    Returns the number of set bits in n
    """
    cnt = 0
    while n:
    n &= n - 1
    cnt += 1
    return cnt

and not tested,

def iterinds(n):
    """
    Returns a generator of the indices of the set bits of n
    """
    i = 0
    while n:
    if n & 1:
    yield i
    n = n >> 1
    i += 1


Sure but these seem rather naive implementation with a time complexity of
O(n) where n is the maximum number of possible elements. Using these would
turn my O(n) algorithm in a O(n^2) algorithm.



I thought your main concern was memory. Of course, dependent on various 
factors, you might be able to do much better than the above. But I don't 
know what your O(n) algorithm is, how using a bitset would make it 
O(n^2), or if the O(n^2) algorithm would actually be slower for typical 
n. The overall thing sounds broadly like some of the blocking and 
clustering methods I've come across in record linkage.


Duncan
--
https://mail.python.org/mailman/listinfo/python-list


Re: How to pick out the same titles.

2016-10-16 Thread duncan smith
On 16/10/16 16:16, Seymore4Head wrote:
> How to pick out the same titles.
> 
> I have a  long text file that has movie titles in it and I would like
> to find dupes.
> 
> The thing is that sometimes I have one called "The Killing Fields" and
> it also could be listed as "Killing Fields"  Sometimes the title will
> have the date a year off.
> 
> What I would like to do it output to another file that show those two
> as a match.
> 
> I don't know the best way to tackle this.  I would think you would
> have to pair the titles with the most consecutive letters in a row.
> 
> Anyone want this as a practice exercise?  I don't really use
> programming enough to remember how.
> 

Tokenize, generate (token) set similarity scores and cluster on
similarity score.


>>> import tokenization
>>> bigrams1 = tokenization.n_grams("The Killing Fields".lower(), 2,
pad=True)
>>> bigrams1
['_t', 'th', 'he', 'e ', ' k', 'ki', 'il', 'll', 'li', 'in', 'ng', 'g ',
' f', 'fi', 'ie', 'el', 'ld', 'ds', 's_']
>>> bigrams2 = tokenization.n_grams("Killing Fields".lower(), 2, pad=True)
>>> import pseudo
>>> pseudo.Jaccard(bigrams1, bigrams2)
0.7


You could probably just generate token sets, then iterate through all
title pairs and manually review those with similarity scores above a
suitable threshold. The code I used above is very simple (and pasted below).


def n_grams(s, n, pad=False):
# n >= 1
# returns a list of n-grams
# or an empty list if n > len(s)
if pad:
s = '_' * (n-1) + s + '_' * (n-1)
return [s[i:i+n] for i in range(len(s)-n+1)]

def Jaccard(tokens1, tokens2):
# returns exact Jaccard
# similarity measure for
# two token sets
tokens1 = set(tokens1)
tokens2 = set(tokens2)
return len(tokens1&tokens2) / len(tokens1|tokens2)


Duncan


-- 
https://mail.python.org/mailman/listinfo/python-list


retain dimensions for numpy slice

2016-10-24 Thread duncan smith
Hello,
  I have several arrays that I need to combine elementwise in
various fashions. They are basically probability tables and there is a
mapping of axes to variables. I have code for transposing and reshaping
that aligns the variables / axes so the usual broadcasting rules achieve
the desired objective. But for a specific application I want to avoid
the transposing and reshaping. So I've specified arrays that contain the
full dimensionality (dimensions equal to the total number of variables).
e.g.

Arrays with shape,

[1,3,3] and [2,3,1]

to represent probability tables with variables

[B,C] and [A,B].

One operation that I need that is not elementwise is summing over axes,
but I can use numpy.sum with keepdims=True to retain the appropriate shape.

The problem I have is with slicing. This drops dimensions. Does anyone
know of a solution to this so that I can e.g. take an array with shape
[2,3,1] and generate a slice with shape [2,1,1]? I'm hoping to avoid
having to manually reshape it. Thanks.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: retain dimensions for numpy slice

2016-10-24 Thread duncan smith
On 24/10/16 19:05, Peter Otten wrote:
> duncan smith wrote:
> 
>> Hello,
>>   I have several arrays that I need to combine elementwise in
>> various fashions. They are basically probability tables and there is a
>> mapping of axes to variables. I have code for transposing and reshaping
>> that aligns the variables / axes so the usual broadcasting rules achieve
>> the desired objective. But for a specific application I want to avoid
>> the transposing and reshaping. So I've specified arrays that contain the
>> full dimensionality (dimensions equal to the total number of variables).
>> e.g.
>>
>> Arrays with shape,
>>
>> [1,3,3] and [2,3,1]
>>
>> to represent probability tables with variables
>>
>> [B,C] and [A,B].
>>
>> One operation that I need that is not elementwise is summing over axes,
>> but I can use numpy.sum with keepdims=True to retain the appropriate
>> shape.
>>
>> The problem I have is with slicing. This drops dimensions. Does anyone
>> know of a solution to this so that I can e.g. take an array with shape
>> [2,3,1] and generate a slice with shape [2,1,1]? I'm hoping to avoid
>> having to manually reshape it. Thanks.
> 
> Can you clarify your requirement or give an example of what you want?
> 
> Given an array 
> 
>>>> a.shape
> (2, 3, 1)
> 
> you can get a slice with shape (2,1,1) with (for example)
> 
>>>> a[:,:1,:].shape
> (2, 1, 1)
> 
> or even
> 
>>>> newshape = (2, 1, 1)
>>>> a[tuple(slice(d) for d in newshape)].shape
> (2, 1, 1)
> 
> but that's probably not what you are asking for...
> 

Thanks. I think that's exactly what I wanted.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: need some kind of "coherence index" for a group of strings

2016-11-04 Thread duncan smith
On 03/11/16 16:18, Fillmore wrote:
> 
> Hi there, apologies for the generic question. Here is my problem let's
> say that I have a list of lists of strings.
> 
> list1:#strings are sort of similar to one another
> 
>   my_nice_string_blabla
>   my_nice_string_blqbli
>   my_nice_string_bl0bla
>   my_nice_string_aru
> 
> 
> list2:#strings are mostly different from one another
> 
>   my_nice_string_blabla
>   some_other_string
>   yet_another_unrelated string
>   wow_totally_different_from_others_too
> 
> 
> I would like an algorithm that can look at the strings and determine
> that strings in list1 are sort of similar to one another, while the
> strings in list2 are all different.
> Ideally, it would be nice to have some kind of 'coherence index' that I
> can exploit to separate lists given a certain threshold.
> 
> I was about to concoct something using levensthein distance, but then I
> figured that it would be expensive to compute and I may be reinventing
> the wheel.
> 
> Thanks in advance to python masters that may have suggestions...
> 
> 
> 

https://pypi.python.org/pypi/jellyfish/

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


OSError: [Errno 12] Cannot allocate memory

2016-11-30 Thread duncan smith
Hello,
  I have had an issue with some code for a while now, and I have not
been able to solve it. I use the subprocess module to invoke dot
(Graphviz) to generate a file. But if I do this repeatedly I end up with
an error. The following traceback is from a larger application, but it
appears to be repeated calls to 'to_image' that is the issue.


Traceback (most recent call last):
  File "", line 1, in 
z = link_exp.sim1((djt, tables), variables, 1000, 400, 600,
[0,1,2,3,4,5,6], [6,7,8,9,10], ind_gens=[link_exp.males_gen()],
ind_gens_names=['Forename'], seed='duncan')
  File "link_exp.py", line 469, in sim1
RL_F2 = EM_setup(data)
  File "link_exp.py", line 712, in full_EM
last_g = prop.djt.g
  File "Nin.py", line 848, in draw_model
dot_g.to_image(filename, prog='dot', format=format)
  File "dot.py", line 597, in to_image
to_image(str(self), filename, prog, format)
  File "dot.py", line 921, in to_image
_execute('%s -T%s -o %s' % (prog, format, filename))
  File "dot.py", line 887, in _execute
close_fds=True)
  File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1235, in _execute_child
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory


The relevant (AFAICT) code is,


def to_image(text, filename, prog='dot', format='dot'):
# prog can be a series of commands
# like 'unflatten -l 3 | dot'
handle, temp_path = tempfile.mkstemp()
f = open(temp_path, 'w')
try:
f.write(text)
f.close()
progs = prog.split('|')
progs[0] = progs[0] + ' %s ' % temp_path
prog = '|'.join(progs)
_execute('%s -T%s -o %s' % (prog, format, filename))
finally:
f.close()
os.remove(temp_path)
os.close(handle)

def _execute(command):
# shell=True security hazard?
p = subprocess.Popen(command, shell=True, stdin=subprocess.PIPE,
 stdout=subprocess.PIPE,
 stderr=subprocess.STDOUT,
 close_fds=True)
output = p.stdout.read()
p.stdin.close()
p.stdout.close()
#p.communicate()
if output:
print output


Any help solving this would be appreciated. Searching around suggests
this is something to do with file handles, but my various attempts to
solve it have failed. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: OSError: [Errno 12] Cannot allocate memory

2016-11-30 Thread duncan smith

[snip]

Sorry, should have said Python 2.7.12 on Ubuntu 16.04.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: OSError: [Errno 12] Cannot allocate memory

2016-11-30 Thread duncan smith
On 30/11/16 17:57, Chris Angelico wrote:
> On Thu, Dec 1, 2016 at 4:34 AM, duncan smith  wrote:
>>
>> def _execute(command):
>> # shell=True security hazard?
>> p = subprocess.Popen(command, shell=True, stdin=subprocess.PIPE,
>>  stdout=subprocess.PIPE,
>>  stderr=subprocess.STDOUT,
>>  close_fds=True)
>> output = p.stdout.read()
>> p.stdin.close()
>> p.stdout.close()
>> #p.communicate()
>> if output:
>> print output
> 
> Do you ever wait() these processes? If not, you might be leaving a
> whole lot of zombies behind, which will eventually exhaust your
> process table.
> 
> ChrisA
> 

No. I've just called this several thousand times (via calls from a
higher level function) and had no apparent problem. Top reports no
zombie tasks, and memory consumption and the number of sleeping tasks
seem to be reasonably stable. I'll try running the code that generated
the error to see if I can coerce it into failing again. OK, no error
this time. Great, an intermittent bug that's hard to reproduce ;-). At
the end of the day I just want to invoke dot to produce an image file
(perhaps many times). Is there perhaps a simpler and more reliable way
to do this? Or do I just add the p.wait()? (The commented out
p.communicate() is there from a previous, failed attempt to fix this -
as, I think, are the shell=True and close_fds=True.) Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: OSError: [Errno 12] Cannot allocate memory

2016-11-30 Thread duncan smith
On 30/11/16 17:53, Chris Kaynor wrote:
> On Wed, Nov 30, 2016 at 9:34 AM, duncan smith  wrote:
>> Hello,
>>   I have had an issue with some code for a while now, and I have not
>> been able to solve it. I use the subprocess module to invoke dot
>> (Graphviz) to generate a file. But if I do this repeatedly I end up with
>> an error. The following traceback is from a larger application, but it
>> appears to be repeated calls to 'to_image' that is the issue.
> 
> I don't see any glaring problems that would obviously cause this,
> however have you checked to see if the processes are actually exiting
> (it looks like you are on Linux, so the top command)?
> 
>>
>>
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> z = link_exp.sim1((djt, tables), variables, 1000, 400, 600,
>> [0,1,2,3,4,5,6], [6,7,8,9,10], ind_gens=[link_exp.males_gen()],
>> ind_gens_names=['Forename'], seed='duncan')
>>   File "link_exp.py", line 469, in sim1
>> RL_F2 = EM_setup(data)
>>   File "link_exp.py", line 712, in full_EM
>> last_g = prop.djt.g
>>   File "Nin.py", line 848, in draw_model
>> dot_g.to_image(filename, prog='dot', format=format)
>>   File "dot.py", line 597, in to_image
>> to_image(str(self), filename, prog, format)
>>   File "dot.py", line 921, in to_image
>> _execute('%s -T%s -o %s' % (prog, format, filename))
>>   File "dot.py", line 887, in _execute
>> close_fds=True)
>>   File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
>> errread, errwrite)
>>   File "/usr/lib/python2.7/subprocess.py", line 1235, in _execute_child
>> self.pid = os.fork()
>> OSError: [Errno 12] Cannot allocate memory
>>
>>
>> The relevant (AFAICT) code is,
>>
>>
>> def to_image(text, filename, prog='dot', format='dot'):
>> # prog can be a series of commands
>> # like 'unflatten -l 3 | dot'
>> handle, temp_path = tempfile.mkstemp()
>> f = open(temp_path, 'w')
>> try:
>> f.write(text)
>> f.close()
>> progs = prog.split('|')
>> progs[0] = progs[0] + ' %s ' % temp_path
>> prog = '|'.join(progs)
>> _execute('%s -T%s -o %s' % (prog, format, filename))
>> finally:
>> f.close()
>> os.remove(temp_path)
>> os.close(handle)
>>
>> def _execute(command):
>> # shell=True security hazard?
>> p = subprocess.Popen(command, shell=True, stdin=subprocess.PIPE,
>>  stdout=subprocess.PIPE,
>>  stderr=subprocess.STDOUT,
>>  close_fds=True)
>> output = p.stdout.read()
>> p.stdin.close()
>> p.stdout.close()
>> #p.communicate()
>> if output:
>> print output
> 
> This code has a potential dead-lock. If you are calling it from
> multiple threads/processes, it could cause issues. This should be
> obvious, as your program will also not exit. The communicate call is
> safe, but commented out (you'd need to remove the three lines above it
> as well). Additionally, you could just set stdin=None rather than
> PIPE, which avoids the dead-lock, and you aren't using stdin anyways.
> This issues comes if the subprocess may ever wait for something to be
> written to stdin, it will block forever, but your call to read will
> also block until it closes stdout (or possibly other cases). Another
> option would be to close stdin before starting the read, however if
> you ever write to stdin, you'll reintroduce the same issue, depending
> on OS buffer sizes.
> 
> My question above also comes from the fact that I am not 100% sure
> when stdout.read() will return. It is possible that a null or EOF
> could cause it to return before the process actually exits. The
> subprocess could also expliciting close its stdout, causing it to
> return while the process is still running. I'd recommend adding a
> p.wait() or just uncommenting the p.communicate() call to avoid these
> issues.
> 
> Another, unrelated note, the security hazard depends on where the
> arguments to execute are coming from. If any of those are controlled
> from untrusted sources (namely, user input), you have a
> shell-injection attack. Imagine, for example, if the user requests the
> filename "a.jpg|wipehd" (note: I don't know the format command on
> Linux, so replace with your desired command). This will cause your
> code to wipe the HD by piping into the command. If all of the inputs
> are 100% sanitized or come from trusted sources, you're fine, however
> that can be extremely difficult to guarantee.
> 

Thanks. So something like the following might do the job?

def _execute(command):
p = subprocess.Popen(command, shell=False,
 stdout=subprocess.PIPE,
 stderr=subprocess.STDOUT,
 close_fds=True)
out_data, err_data = p.communicate()
if err_data:
print err_data


Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: OSError: [Errno 12] Cannot allocate memory

2016-11-30 Thread duncan smith
On 01/12/16 00:46, Chris Kaynor wrote:
> On Wed, Nov 30, 2016 at 4:12 PM, duncan smith  wrote:
>> On 30/11/16 17:57, Chris Angelico wrote:
>>> On Thu, Dec 1, 2016 at 4:34 AM, duncan smith  wrote:
>>>>
>>>> def _execute(command):
>>>> # shell=True security hazard?
>>>> p = subprocess.Popen(command, shell=True, stdin=subprocess.PIPE,
>>>>  stdout=subprocess.PIPE,
>>>>  stderr=subprocess.STDOUT,
>>>>  close_fds=True)
>>>> output = p.stdout.read()
>>>> p.stdin.close()
>>>> p.stdout.close()
>>>> #p.communicate()
>>>> if output:
>>>> print output
>>>
>>> Do you ever wait() these processes? If not, you might be leaving a
>>> whole lot of zombies behind, which will eventually exhaust your
>>> process table.
>>>
>>> ChrisA
>>>
>>
>> No. I've just called this several thousand times (via calls from a
>> higher level function) and had no apparent problem. Top reports no
>> zombie tasks, and memory consumption and the number of sleeping tasks
>> seem to be reasonably stable. I'll try running the code that generated
>> the error to see if I can coerce it into failing again. OK, no error
>> this time. Great, an intermittent bug that's hard to reproduce ;-). At
>> the end of the day I just want to invoke dot to produce an image file
>> (perhaps many times). Is there perhaps a simpler and more reliable way
>> to do this? Or do I just add the p.wait()? (The commented out
>> p.communicate() is there from a previous, failed attempt to fix this -
>> as, I think, are the shell=True and close_fds=True.) Cheers.
> 
> That would appear to rule out the most common issues I would think of.
> 
> That said, are these calls being done in a tight loop (the full
> call-stack implies it might be a physics simulation)? Are you doing
> any threading (either in Python or when making the calls to Python -
> using a bash command to start new processes without waiting counts)?
> Is there any exception handling at a higher level that might be
> continuing past the error and sometimes allowing a zombie process to
> stay?
> 

In this case the calls *are* in a loop (record linkage using an
expectation maximization algorithm).

> If you are making a bunch of calls in a tight loop, that could be your
> issue, especially as you are not waiting on the process (though the
> communicate does so implicitly, and thus should have fixed the issue).
> This could be intermittent if the processes sometimes complete
> quickly, and other times are delayed. In these cases, a ton of the dot
> processes (and shell with shell=true) could be created before any
> finish, thus causing massive usage. Some of the processes may be
> hanging, rather than outright crashing, and thus leaking some
> resources.
> 

I'll try the p.communicate thing again. The last time I tried it I might
have already got myself into a situation where launching more
subprocesses was bound to fail. I'll edit the code, launch IDLE again
and see if it still happens.

> BTW, the docstring in to_image implies that the shell=True is not an
> attempted fix for this - the example 'unflatten -l 3 | dot' is
> explicitly suggesting the usage of shell=True.
> 


OK. As you can see, I don't really understand what's happening under the
hood :-). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: OSError: [Errno 12] Cannot allocate memory

2016-12-01 Thread duncan smith
On 01/12/16 01:12, Chris Kaynor wrote:
> On Wed, Nov 30, 2016 at 4:54 PM, duncan smith  wrote:
>>
>> Thanks. So something like the following might do the job?
>>
>> def _execute(command):
>> p = subprocess.Popen(command, shell=False,
>>  stdout=subprocess.PIPE,
>>  stderr=subprocess.STDOUT,
>>  close_fds=True)
>> out_data, err_data = p.communicate()
>> if err_data:
>> print err_data
> 
> I did not notice it when I sent my first e-mail (but noted it in my
> second one) that the docstring in to_image is presuming that
> shell=True. That said, as it seems everybody is at a loss to explain
> your issue, perhaps there is some oddity, and if everything appears to
> work with shell=False, it may be worth changing to see if it does fix
> the problem. With other information since provided, it is unlikely,
> however.
> 
> Not specifying the stdin may help, however it will only reduce the
> file handle count by 1 per call (from 2), so there is probably a root
> problem that it will not help.
> 
> I would expect the communicate change to fix the problem, except for
> your follow-up indicating that you had tried that before without
> success.
> 
> Removing the manual stdout.read may fix it, if the problem is due to
> hanging processes, but again, your follow-up indicates thats not the
> problem - you should have zombie processes if that were the case.
> 
> A few new questions that you have not answered (nor have they been
> asked in this thread): How much memory does your system have? Are you
> running a 32-bit or 64-bit Python? Is your Python process being run
> with any additional limitations via system commands (I don't know the
> command, but I know it exists; similarly, if launched from a third
> app, it could be placing limits)?
> 
> Chris
> 

8 Gig, 64 bit, no additional limitations (other than any that might be
imposed by IDLE). In this case the simulation does consume *a lot* of
memory, but that hasn't been the case when I've hit this in the past. I
suppose that could be the issue here. I'm currently seeing if I can
reproduce the problem after adding the p.communicate(), but it seems to
be using more memory than ever (dog slow and up to 5 Gig of swap). In
the meantime I'm going to try to refactor to reduce memory requirements
- and 32 Gig of DDR3 has been ordered. I'll also dig out some code that
generated the same problem before to see if I can reproduce it. Cheers.

Duncan

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: geopandas bug ?

2017-01-14 Thread duncan smith
On 14/01/17 11:18, Xristos Xristoou wrote:
> i want  to create a simple spatial joing using geopandas but i thing so 
> geopandas has bug ?
> 
> 
> 
> geopandas code :
> 
> from geopandas import gpd
> import geopandas
> points = geopandas.GeoDataFrame.from_file('points.shp') # or geojson etc
> polys = geopandas.GeoDataFrame.from_file('polygons.shp')
> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
> 
> error :
> 
> Traceback (most recent call last):
>   File "/home/sarantis/testshapely/sumpointsinsidepolygon/testgeo.py", line 
> 7, in 
> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
>   File "/usr/local/lib/python2.7/dist-packages/geopandas/tools/sjoin.py", 
> line 57, in sjoin
> r_idx = np.concatenate(idxmatch.values)
> ValueError: need at least one array to concatenate
> 
> and if i change the imports with the some code :
> 
> import geopandas
> import pandas as pd
> import geopandas as gpd
> from geopandas import GeoDataFrame, read_file
> from geopandas.tools import sjoin
> from shapely.geometry import Point, mapping,shape
> import pandas as gpd
> 
> i take that error :
> 
> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
> AttributeError: 'module' object has no attribute 'sjoin'
> 
> 
> any idea why ?
> 


import geopandas as gpd
import pandas as gpd

Does pandas have an attribute 'sjoin'?

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: geopandas bug ?

2017-01-14 Thread duncan smith
On 14/01/17 14:59, Xristos Xristoou wrote:
> Τη Σάββατο, 14 Ιανουαρίου 2017 - 4:38:39 μ.μ. UTC+2, ο χρήστης duncan smith 
> έγραψε:
>> On 14/01/17 11:18, Xristos Xristoou wrote:
>>> i want  to create a simple spatial joing using geopandas but i thing so 
>>> geopandas has bug ?
>>>
>>>
>>>
>>> geopandas code :
>>>
>>> from geopandas import gpd
>>> import geopandas
>>> points = geopandas.GeoDataFrame.from_file('points.shp') # or geojson etc
>>> polys = geopandas.GeoDataFrame.from_file('polygons.shp')
>>> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
>>>
>>> error :
>>>
>>> Traceback (most recent call last):
>>>   File "/home/sarantis/testshapely/sumpointsinsidepolygon/testgeo.py", line 
>>> 7, in 
>>> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
>>>   File "/usr/local/lib/python2.7/dist-packages/geopandas/tools/sjoin.py", 
>>> line 57, in sjoin
>>> r_idx = np.concatenate(idxmatch.values)
>>> ValueError: need at least one array to concatenate
>>>
>>> and if i change the imports with the some code :
>>>
>>> import geopandas
>>> import pandas as pd
>>> import geopandas as gpd
>>> from geopandas import GeoDataFrame, read_file
>>> from geopandas.tools import sjoin
>>> from shapely.geometry import Point, mapping,shape
>>> import pandas as gpd
>>>
>>> i take that error :
>>>
>>> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
>>> AttributeError: 'module' object has no attribute 'sjoin'
>>>
>>>
>>> any idea why ?
>>>
>>
>>
>> import geopandas as gpd
>> import pandas as gpd
>>
>> Does pandas have an attribute 'sjoin'?
>>
>> Duncan
> 
> i dont know i follow detais i am newbie
> 

You import geopandas as gpd, then import pandas as gpd. So maybe you
think gpd refers to geopandas while it actually refers to pandas. I
don't know geopandas or pandas, but you should check your imports.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Treatment of NANs in the statistics module

2018-03-17 Thread duncan smith
On 16/03/18 23:16, Steven D'Aprano wrote:
> The bug tracker currently has a discussion of a bug in the median(), 
> median_low() and median_high() functions that they wrongly compute the 
> medians in the face of NANs in the data:
> 
> https://bugs.python.org/issue33084
> 
> I would like to ask people how they would prefer to handle this issue:
> 
> (1) Put the responsibility on the caller to strip NANs from their data. 
> If there is a NAN in your data, the result of calling median() is 
> implementation-defined. This is the current behaviour, and is likely to 
> be the fastest.
> 
> (2) Return a NAN.
> 
> (3) Raise an exception.
> 
> (4) median() should strip out NANs.
> 
> (5) All of the above, selected by the caller. (In which case, which would 
> you prefer as the default?)
> 
> 
> Thank you.
> 
> 
> 
> 

(2). A user can check for a returned NaN if necessary (so no real need
for (3)). Silently stripping out NaNs strikes me as a terrible idea. The
user should decide how NaNs should be dealt with. Optional arguments to
govern the handling of NaNs - OK as long as the default behaviour is to
return a NaN. There is no sensible default for handling NaNs (or missing
values). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Instance variables question

2018-04-16 Thread duncan smith
On 16/04/18 17:03, Irv Kalb wrote:
> I have been writing OOP code for many years in other languages and for the 
> past few years in Python.  I am writing new curriculum for a course on OOP in 
> Python.  In order to see how others are explaining OOP concepts, I have been 
> reading as many books and watching as many videos as I can.   I've been 
> watching some videos created by Dr. Chuck Severance in a series called 
> "Python For Everyone".  I think "Dr. Chuck" is an excellent teacher and I 
> think his videos are outstanding.  
> 
> Today I watched this video:   https://www.youtube.com/watch?v=b2vc5uzUfoE 
>   which is about 10 minutes 
> long.  In that video he gives a very basic overview of OOP and classes.  He 
> gives a demonstration using the following example:
> 
> class PartyAnimal():
> x = 0
> 
> def party(self):
> self.x = self.x + 1
> print('So far', self.x)
> 
> an = PartyAnimal()
> an.party()
> an.party()
> an.party()
> 
> # I added this line just to see what it would do
> print('Class variable', PartyAnimal.x)
> 
> 
> And the output is:
> 
> So far 1
> So far 2
> So far 3
> Class variable 0
> 
> But there is something there that seems odd.  My understanding is that the "x 
> = 0" would be defining a class variable, that can be shared by all 
> PartyAnimal objects.  But he explains that because x is defined between the 
> class statement and the "party" method, that this defines an instance 
> variable x.   That way, it can be used in the first line of the "party" 
> method as self.x to increment itself.   
> 
> At the end of the video, he creates two objects from the same class, and each 
> one gets its own self.x where each correctly starts at zero.  Again, I 
> expected x to be a class variable (accessible through PartyAnimal.x).  
> 
> When I want to create an instance variable and to be used later in other 
> methods, I do this:
> 
> class PartyAnimal():
> def __init__(self):
>   self.x = 0  
> 
> def party(self):
> self.x = self.x + 1
> print('So far', self.x)
> 
> an = PartyAnimal()
> an.party()
> an.party()
> an.party()
> 
> That is, I would assign the instance variable in the __init__ method.  Both 
> approaches give the same results.
> 
> I'm certainly not trying to argue with Dr. Chuck.  I am trying to understand 
> his approach, but it's not clear to me why his code works.  Specifically, can 
> anyone explain how his "x = 0" turns x into an instance variable - while also 
> allowing the syntax for a class variable PartyAnimal.x to be used?
> 
> Thanks,
> 
> Irv
> 

My understanding of this:

x is a class variable.

Initially an instance has no instance variable self.x. So on the first
call to self.party the name 'x' is looked up in the class. This value is
incremented and the result is assigned to self.x. This is where the
instance variable x is created and set. On subsequent calls to
self.party there exists an instance variable x, so the name 'x' is not
looked up in the class.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


plotly / dash export data

2018-06-09 Thread duncan smith
Hello,
  I have been trying to work out how to export data from a dash
application. I know little about html or javascript. I can upload data
into a table as per the example at
https://github.com/plotly/dash-core-components/pull/73. I can edit the
data in the table. But I can't find a way of exporting the edited data.
(Saving the data to file after each edit and providing a link to said
file might do it, but doesn't seem particularly clean). The data are
model parameters, and the user can edit these and run simulations. It
seems unreasonable to allow the user to experiment with different
parameters, but not allow them to export the parameters for future use.
Hopefully someone with better knowledge of dash (or html) can point me
in the right direction. TIA.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help Needed : script weird result.

2018-09-01 Thread duncan smith
On 01/09/18 18:11, moha...@gmail.com wrote:
> All,
> 
> I m trying to run this small script to find the lowest of the given array of 
> numbers. The script works fine for various combination of inputs but fails in 
> a weird way for a particular set of inputs, can anyone point the mistake in 
> the script and the behavior.
> 
> Script
> 
> x = input ("Enter the numbers separated by space and press ENTER :")
> x = x.split(" ")
> 
> def checkmin(arr):
> lowest = arr[0]
> for count in range(0,len(arr),1):
> if arr[count] < lowest :
> lowest = arr[count]
> else :
> pass
> print (lowest)
> return lowest
> 
> minimum = checkmin(x)
> print ("Lowest : {0}".format (minimum))
> 
> 
> Weird output is as below.
> 
> == RESTART: C:\Users\mohan\Desktop\temp.py ==
> Enter the numbers separated by space and press ENTER :5 90 63 82 59 24
> 5
> 5
> 5
> 5
> 5
> 24
> Lowest : 24
> 
> Regards
> Mohan C
> 


Compare,

>>> min(5, 90, 63, 82, 59, 24)
5
>>> min('5', '90', '63', '82', '59', '24')
'24'
>>>

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


numpy use cases for where and out

2018-09-11 Thread duncan smith
Hello,
  I'm writing some code for sparse arrays that is intended to pretty
much follow the numpy API. Because my arrays can have different default
values there is an issue with using the 'out' keyword argument for
functions. e.g. If I elementwise multiply 2 arrays with defaults a and
b, then the default of the product becomes a*b. Changing the default of
'out' to a*b (if that is not the existing default) requires pre-existing
values in 'out' equal to the previous default to be filled in. It seems
to be more hassle than it's worth, and basically eliminates sparsity.

So I am wondering if there are any compelling use cases for 'out'. At
the same time, I'm wondering about 'where'. It's less of an issue. But
given my dictionary of keys implementation, and returning new arrays
rather than views, I'm wondering if I could reasonably eliminate both. I
don't think I've ever used either argument when using numpy. So I'm
asking here in case there are use cases I've overlooked. TIA.

Duncan

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Why do integers compare equal to booleans?

2018-11-16 Thread duncan smith
On 16/11/18 14:51, Steve Keller wrote:
> Why do the integers 0 and 1 compare equal to the boolean values False
> and True and all other integers to neither of them?
> 
> $ python3
> Python 3.5.2 (default, Nov 12 2018, 13:43:14)
> [GCC 5.4.0 20160609] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 0 == False
> True
> >>> 1 == True
> True
> >>> 2 == False
> False
> >>> 2 == True
> False
> >>> -1 == False
> False
> >>> -1 == True
> False
> >>>
> 
> Since these are objects of different types I would expect they cannot
> be equal.  I know that 0 means false and != 0 means true in C, C++,
> etc. but in Python that surprises me.
> 
> Steve
> 

>>> isinstance(False, int)
True
>>> isinstance(True, int)
True
>>> False.real
0
>>> True.real
1
>>>

At least in recent Pythons.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Injecting methods into instance / class

2018-12-02 Thread duncan smith
Hello,
  I have a lot of functions that take an instance of a particular
class as the first argument. I want to create corresponding methods in
the class. I have tried the following, which (when called from __init__)
creates the relevant methods in an instance (Python 3.6).


def init_methods(self):
for func_name, method_name in [('add', '__add__'),
   ('subtract', '__sub__')]:
setattr(self, method_name,
types.MethodType(globals()[func_name], self))


The problem is that e.g.

x.__sub__(y)

works as expected, but

x - y

does not (because special methods are looked up in the class rather than
the instance).

I have tried to find examples of injecting methods into classes without
success. I have tried a few things that intuition suggested might work,
but didn't - like removing the first line above, dedenting and replacing
self with the class. This is only to save typing and make the code
cleaner, but I would like to get it right. Any pointers appreciated. TIA.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Injecting methods into instance / class

2018-12-02 Thread duncan smith
On 02/12/2018 18:26, Stefan Ram wrote:
> duncan smith  writes:
>> I have tried to find examples of injecting methods into classes without
> 
>   Wouldn't the normal approach be to just define a
>   class with your functions as instance methods?
> 
>   main.py
> 
> class C():
> def __init__( self, value=0 ):
> self.value = value
> def __sub__( self, other ):
> return C( self.value - other.value )
> def __str__( self ):
> return 'C '+ str( self.value )
> def yourfunctionhere( self ):
> pass
> 
> c = C(); c.value = 22
> d = C(); d.value = 27
> print( d - c )
> 
>   transcript
> 
> C 5
> 

What I'm trying to avoid is,


def __sub__(self, other):
return subtract(self, other)

def __add__(self, other):
return add(self, other)

def __mul__(self, other):
return multiply(self, other)

def some_func(self, *args, **kwargs):
return some_func(self, *args, **kwargs)


for many existing functions. Injecting them as instance methods was
probably not ideal, but I could get it working (apart from the special
methods). Ideally I'd like to take all the functions defined in a
separate module, create ordinary class methods from most of them (with
the same name), then create special methods from the remaining
functions. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Injecting methods into instance / class

2018-12-02 Thread duncan smith
On 02/12/2018 18:36, Peter Otten wrote:
> duncan smith wrote:
> 
>> Hello,
>>   I have a lot of functions that take an instance of a particular
>> class as the first argument. I want to create corresponding methods in
>> the class. I have tried the following, which (when called from __init__)
>> creates the relevant methods in an instance (Python 3.6).
>>
>>
>> def init_methods(self):
>> for func_name, method_name in [('add', '__add__'),
>>('subtract', '__sub__')]:
>> setattr(self, method_name,
>> types.MethodType(globals()[func_name], self))
>>
>>
>> The problem is that e.g.
>>
>> x.__sub__(y)
>>
>> works as expected, but
>>
>> x - y
>>
>> does not (because special methods are looked up in the class rather than
>> the instance).
>>
>> I have tried to find examples of injecting methods into classes without
>> success. I have tried a few things that intuition suggested might work,
>> but didn't - like removing the first line above, dedenting and replacing
>> self with the class. This is only to save typing and make the code
>> cleaner, but I would like to get it right. Any pointers appreciated. TIA.
> 
> You can do this in the __ init__() method:
> 
> $ cat tmp.py
> def add(self, other):
> return 42 + other
> 
> def init_methods(self):
> cls = type(self)
> for methodname, funcname in [("__add__", "add")]:
> func = globals()[funcname]
> setattr(cls, methodname, func)
> 
> class Demo:
> def __init__(self):
> init_methods(self)
> 
> x = Demo()
> print(x + 100)
> $ python3 tmp.py
> 142
> 
> but the __add__() method (to take the example above) is shared by all 
> instances. You are either doing too much work by setting it again and again 
> in every Demo.__init__() call or you are changing the behaviour of 
> previously initialised Demo instances (if the last init_methods() sets the 
> method to a different function) and confuse yourself.
> 
> Have you considered a mix-in class instead? Like
> 
> $ cat tmp.py
> def add(self, other):
> return 42 + other
> 
> class CommonMethods:
> __add__ = add
> 
> class Demo(CommonMethods):
> pass
> 
> x = Demo()
> print(x + 100)
> $ python3 tmp.py
> 142
> 
> This is much cleaner, and if you insist you can set up CommonMethods 
> programmatically with
> 
> CommonMethods = type(
> "CommonMethods", (),
> {m: globals()[f] for m, f in [("__add__", "add")]}
> # without globals() and probably better: 
> # {m: f for m, f in [("__add__", add)]}
> )
> 
> though personally I don't see the point.
> 

Ah, I could just bind them within the class,

mean = mean
max = max etc.

but I'd somehow convinced myself that didn't work. As the names are the
same I'd probably still prefer to go with something along the lines of

for name in ['mean', 'max', ...]:
# create a method

But at least I now have something that works, and I could try something
like your suggestion above to see if I prefer it.

The issue was that some of these "functions" are actually callable class
instances (an attempt to emulate numpy's ufuncs).


class Func:
def __init__(self, op):
# op is a numpy ufunc
self.op = op

def __call__(self, *args, **kwargs):
# some logic here to decide
# which method to call on the
# basis of the number of arguments
# and return values

def one_one(self, x):
# the method to call for a single argument
# and single return value


Just rebinding these doesn't work, and these are the callables that need
corresponding special methods. I guess I probably tried binding one of
these first and convinced myself that it didn't work generally.

Many thanks (to everyone who has responded). Not exactly solved, but
I've made some progress. I might just write the methods normally, rather
than looking for a clever shortcut. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


sampling from frequency distribution / histogram without replacement

2019-01-14 Thread duncan smith
Hello,
  Just checking to see if anyone has attacked this problem before
for cases where the population size is unfeasibly large. i.e. The number
of categories is manageable, but the sum of the frequencies, N,
precludes simple solutions such as creating a list, shuffling it and
using the first n items to populate the sample (frequency distribution /
histogram).

I note that numpy.random.hypergeometric will allow me to generate a
sample when I only have two categories, and that I could probably
implement some kind of iterative / partitioning approach calling this
repeatedly. But before I do I thought I'd ask if anyone has tackled this
before. Can't find much on the web. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-14 Thread duncan smith
On 14/01/2019 22:59, Gregory Ewing wrote:
> duncan smith wrote:
>> Hello,
>>   Just checking to see if anyone has attacked this problem before
>> for cases where the population size is unfeasibly large.
> 
> The fastest way I know of is to create a list of cumulative
> frequencies, then generate uniformly distributed numbers and
> use a binary search to find where they fall in the list.
> That's O(log n) per sample in the size of the list once it's
> been set up.
> 

That's the sort of thing I've been thinking about. But once I'd found
the relevant category I'd need to reduce its frequency by 1 and
correspondingly update the cumulative frequencies. Alternatively, I
could add an extra step where I selected a unit from the relevant
category with probability equal to the proportion of non-sampled units
from the category. I could maybe set up an alias table and do something
similar.

The other thing I was thinking about was iterating through the
categories (ideally from largest frequency to smallest frequency),
generating the numbers to be sampled from the current category and the
remaining categories (using numpy.random.hypergeometric). With a few
large frequencies and lots of small frequencies that could be quite
quick (on average). Alternatively I could partition the categories into
two sets, generate the number to be sampled from each partition, then
partition the partitions etc. binary search style.

I suppose I'll try the both the alias table + rejection step and the
recursive partitioning approach and see how they turn out. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-15 Thread duncan smith
On 15/01/2019 02:41, Spencer Graves wrote:
> 
> 
> On 2019-01-14 18:40, duncan smith wrote:
>> On 14/01/2019 22:59, Gregory Ewing wrote:
>>> duncan smith wrote:
>>>> Hello,
>>>>    Just checking to see if anyone has attacked this problem before
>>>> for cases where the population size is unfeasibly large.
>>> The fastest way I know of is to create a list of cumulative
>>> frequencies, then generate uniformly distributed numbers and
>>> use a binary search to find where they fall in the list.
>>> That's O(log n) per sample in the size of the list once it's
>>> been set up.
>>>
>> That's the sort of thing I've been thinking about. But once I'd found
>> the relevant category I'd need to reduce its frequency by 1 and
>> correspondingly update the cumulative frequencies. Alternatively, I
>> could add an extra step where I selected a unit from the relevant
>> category with probability equal to the proportion of non-sampled units
>> from the category. I could maybe set up an alias table and do something
>> similar.
>>
>> The other thing I was thinking about was iterating through the
>> categories (ideally from largest frequency to smallest frequency),
>> generating the numbers to be sampled from the current category and the
>> remaining categories (using numpy.random.hypergeometric). With a few
>> large frequencies and lots of small frequencies that could be quite
>> quick (on average). Alternatively I could partition the categories into
>> two sets, generate the number to be sampled from each partition, then
>> partition the partitions etc. binary search style.
>>
>> I suppose I'll try the both the alias table + rejection step and the
>> recursive partitioning approach and see how they turn out. Cheers.
> 
> 
>   R has functions "sample" and "sample.int";  see
> "https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample";.
> You can call R from Python,
> "https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python";.
> 
> 
> 
>   These are in the "base" package.  I believe they have been an
> important part of the base R language almost since its inception and
> have been used extensively.  You'd have to work really hard to do
> better, in my judgment.
> 
> 
>       Spencer Graves
> 
> 
> DISCLAIMER:  I'm primarily an R guy and only use Python when I can't
> find a sensible way to do what I want in R.
>>
>> Duncan
> 

Despite being a statistician I'm primarily a Python guy and only use R
when I can't find a sensible way to do what I want in Python :-). The
problem with the R solution is that it doesn't seem to get round the
issue of having an unfeasibly large population size, but a reasonable
number of categories. It turns out I've already coded up a few reservoir
based algorithms for sampling without replacement that work with data
streams. So I can get round the space issues, but it still means
processing each data point rather than generating the sample frequencies
directly.

After much searching all I've been able to find is the approach I
suggested above, iterating through the frequencies. My implementation:


import numpy

def hypgeom_variate(freqs, n):
sample = [0] * len(freqs)
nbad = sum(freqs)
hypergeometric = numpy.random.hypergeometric
for i, ngood in enumerate(freqs):
nbad -= ngood
x = hypergeometric(ngood, nbad, n, 1)[0]
if x:
sample[i] = x
n -= x
if not n:
break
return sample


Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-15 Thread duncan smith
On 15/01/2019 17:59, Ian Hobson wrote:
> Hi,
> 
> If I understand your problem you can do it in two passes through the
> population.
> 

The thing is that I start with the population histogram and I want to
generate a sample histogram. The population itself is too large to deal
with each population member individually.

> First, however, lets work through taking a sample of 2 from 7 to
> demonstrate the method.
> 
> Take the first element with a probability of 2/7. (Note 1).
> If you took it, you only want 1 more, so the probability drops to 1/6.
> If you didn't take it you want 2 from 6, so probability goes to 2/6.
> Take the next in the population with probability 1/6 or 2/6 as appropriate.
> Continue in similar manner until the probability
> drops to 0 (when you have your whole sample). When the
> denominator drops to zero the population is expired.
> 

Yes, based on the chain rule.

> Your first pass has to categorise the population and create your
> histogram, (index N) of frequencies Y(N).
> 
> Then divide up the sample size you wish to take into the histogram,
> giving array X(N) of sample sizes. X(N) need not be integer.
> 
> Then pass through the population again, for each entry:
>    Compute the N it falls in the histogram.
>    Take this entry as a sample with a probability of X(N)/Y(N).  Note 2.
>    If the element was taken, decrement X(N).
>    Decrement Y(N).
>    step to next element.
> 

Ah, I'm not quota sampling. I want a simple random sample without
replacement. I just happen to have the data in the form of categories
and frequencies, and that's the form of output that I want.

> Note 1 - In most languages you can generate a pseudo-random number
> with a uniform distribution from 0 to Y(N)-1. Take the element if it is
> in range 0 to floor(X(N))-1.
> 
> Note 2 - X(N) need not be integer, but you can't actually take a sample
> of 6.5 out of 1000. You will either run out of population having taken
> 6, or, if you take 7, the probability will go negative, and no more
> should be taken (treat as zero). The number taken in slot N will be
> floor(X(N)) or ceiling(X(N)). The average over many tries will however
> be X(N).

> Sorry I did not come back to you sooner. It took a while to drag the
> method out of my memory from some 35 years ago when I was working on an
> audit package. 

Well I'd already forgotten that I'd coded up something for srs without
replacement only a few years ago. In fact I coded up a few algorithms
(that I can't take credit for) that allowed weighted sampling with
replacement, and at least one that didn't require a priori knowledge of
the population size (a single pass algorithm). The problem is that they
also (mostly) require scanning the whole population.

That was where I learned two things you may be interested
> in.

> 1) Auditors significantly under sample. Our Auditors actually took
> samples that were between 10% and 25% of what was necessary to support
> their claims.
> 

It's not just auditors :-(. The journals are full of claims based on
positive results from low powered tests or from "null fields". i.e. A
very high proportion are likely to be false positives (like 99% when it
comes to foodstuffs and the risks of various diseases). A while ago a
mate of mine (Prof. of statistics in Oz) told me about a student who
engineered a statistically significant result by copying and pasting her
data to double her sample size. That's no worse than some of the stuff
I've come across in the (usually medical) journals.

> 2) Very very few standard pseudo-random number generators are actually
> any good.
> 
> Regards
> 
> Ian

[snip]

BTW, the approach I'm currently using is also based on the chain rule.
Generate the number of sample units for the first category by sampling
from a (bivariate) hypergeometric. The number of sample units for the
second category (conditional on the number sampled for the first) is
another hypergeometric. Iterate until the full sample is obtained. It
helps to order the categories from largest to smallest. But I think I'll
get better performance by recursive partitioning (when I have the time
to try it). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-18 Thread duncan smith
On 14/01/2019 20:11, duncan smith wrote:
> Hello,
>   Just checking to see if anyone has attacked this problem before
> for cases where the population size is unfeasibly large. i.e. The number
> of categories is manageable, but the sum of the frequencies, N,
> precludes simple solutions such as creating a list, shuffling it and
> using the first n items to populate the sample (frequency distribution /
> histogram).
> 
> I note that numpy.random.hypergeometric will allow me to generate a
> sample when I only have two categories, and that I could probably
> implement some kind of iterative / partitioning approach calling this
> repeatedly. But before I do I thought I'd ask if anyone has tackled this
> before. Can't find much on the web. Cheers.
> 
> Duncan
> 

After much tinkering I came up with the following:


import numpy as np

def hypgeom_variate(freqs, n):
# recursive partitioning approach
sample = [0] * len(freqs)
cumsum = np.cumsum(list(chain([0], freqs)))
if n > cumsum[-1]:
raise ValueError('n cannot be greater than population size')
hypergeometric = np.random.hypergeometric
argslist = [(0, len(freqs), 0, cumsum[-1], n)]
for i, k, ci, ck, m in argslist:
if k == i + 1:
sample[i] = m
else:
j = (i + k) // 2
cj = cumsum[j]
x = hypergeometric(cj - ci, ck - cj, m, 1)[0]
y = m-x
if x:
argslist.append((i, j, ci, cj, x))
if y:
argslist.append((j, k, cj, ck, y))
return sample


Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


in a pickle

2019-03-06 Thread duncan smith
Hello,
  I've been trying to figure out why one of my classes can be
pickled but not unpickled. (I realise the problem is probably with the
pickling, but I get the error when I attempt to unpickle.)

A relatively minimal example is pasted below.


>>> import pickle
>>> class test(dict):
def __init__(self, keys, shape=None):
self.shape = shape
for key in keys:
self[key] = None

def __setitem__(self, key, val):
print (self.shape)
dict.__setitem__(self, key, val)


>>> x = test([1,2,3])
None
None
None
>>> s = pickle.dumps(x)
>>> y = pickle.loads(s)
Traceback (most recent call last):
  File "", line 1, in 
y = pickle.loads(s)
  File "", line 8, in __setitem__
print (self.shape)
AttributeError: 'test' object has no attribute 'shape'


I have DUCkDuckGo'ed the issue and have tinkered with __getnewargs__ and
__getnewargs_ex__ without being able to figure out exactly what's going
on. If I comment out the print call, then it seems to be fine. I'd
appreciate any pointers to the underlying problem. I have one or two
other things I can do to try to isolate the issue further, but I think
the example is perhaps small enough that someone in the know could spot
the problem at a glance. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: in a pickle

2019-03-06 Thread duncan smith
[snip]

Sorry, this is Python 3.6 on Linux.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: in a pickle

2019-03-06 Thread duncan smith
On 06/03/2019 16:14, duncan smith wrote:
> Hello,
>   I've been trying to figure out why one of my classes can be
> pickled but not unpickled. (I realise the problem is probably with the
> pickling, but I get the error when I attempt to unpickle.)
> 
> A relatively minimal example is pasted below.
> 
> 
>>>> import pickle
>>>> class test(dict):
>   def __init__(self, keys, shape=None):
>   self.shape = shape
>   for key in keys:
>   self[key] = None
> 
>   def __setitem__(self, key, val):
>   print (self.shape)
>   dict.__setitem__(self, key, val)
> 
>   
>>>> x = test([1,2,3])
> None
> None
> None
>>>> s = pickle.dumps(x)
>>>> y = pickle.loads(s)
> Traceback (most recent call last):
>   File "", line 1, in 
> y = pickle.loads(s)
>   File "", line 8, in __setitem__
> print (self.shape)
> AttributeError: 'test' object has no attribute 'shape'
> 
> 
> I have DUCkDuckGo'ed the issue and have tinkered with __getnewargs__ and
> __getnewargs_ex__ without being able to figure out exactly what's going
> on. If I comment out the print call, then it seems to be fine. I'd
> appreciate any pointers to the underlying problem. I have one or two
> other things I can do to try to isolate the issue further, but I think
> the example is perhaps small enough that someone in the know could spot
> the problem at a glance. Cheers.
> 
> Duncan
> 

OK, this seems to  be a "won't fix" bug dating back to 2003
(https://bugs.python.org/issue826897). The workaround,


class DictPlus(dict):
  def __init__(self, *args, **kwargs):
self.extra_thing = ExtraThingClass()
dict.__init__(self, *args, **kwargs)
  def __setitem__(self, k, v):
try:
  do_something_with(self.extra_thing, k, v)
except AttributeError:
  self.extra_thing = ExtraThingClass()
  do_something_with(self.extra_thing, k, v)
dict.__setitem__(self, k, v)
  def __setstate__(self, adict):
pass


doesn't work around the problem for me because I need the actual value
of self.shape from the original instance. But I only need it for sanity
checking, and under the assumption that the original instance was valid,
I don't need to do this when unpickling. I haven't managed to find a
workaround that exploits that (yet?). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: in a pickle

2019-03-06 Thread duncan smith
On 06/03/2019 20:24, Peter Otten wrote:
> duncan smith wrote:
> 
>> On 06/03/2019 16:14, duncan smith wrote:
>>> Hello,
>>>   I've been trying to figure out why one of my classes can be
>>> pickled but not unpickled. (I realise the problem is probably with the
>>> pickling, but I get the error when I attempt to unpickle.)
>>>
>>> A relatively minimal example is pasted below.
>>>
>>>
>>>>>> import pickle
>>>>>> class test(dict):
>>> def __init__(self, keys, shape=None):
>>> self.shape = shape
>>> for key in keys:
>>> self[key] = None
>>>
>>> def __setitem__(self, key, val):
>>> print (self.shape)
>>> dict.__setitem__(self, key, val)
>>>
>>>
>>>>>> x = test([1,2,3])
>>> None
>>> None
>>> None
>>>>>> s = pickle.dumps(x)
>>>>>> y = pickle.loads(s)
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>> y = pickle.loads(s)
>>>   File "", line 8, in __setitem__
>>> print (self.shape)
>>> AttributeError: 'test' object has no attribute 'shape'
>>>
>>>
>>> I have DUCkDuckGo'ed the issue and have tinkered with __getnewargs__ and
>>> __getnewargs_ex__ without being able to figure out exactly what's going
>>> on. If I comment out the print call, then it seems to be fine. I'd
>>> appreciate any pointers to the underlying problem. I have one or two
>>> other things I can do to try to isolate the issue further, but I think
>>> the example is perhaps small enough that someone in the know could spot
>>> the problem at a glance. Cheers.
>>>
>>> Duncan
>>>
>>
>> OK, this seems to  be a "won't fix" bug dating back to 2003
>> (https://bugs.python.org/issue826897). The workaround,
>>
>>
>> class DictPlus(dict):
>>   def __init__(self, *args, **kwargs):
>> self.extra_thing = ExtraThingClass()
>> dict.__init__(self, *args, **kwargs)
>>   def __setitem__(self, k, v):
>> try:
>>   do_something_with(self.extra_thing, k, v)
>> except AttributeError:
>>   self.extra_thing = ExtraThingClass()
>>   do_something_with(self.extra_thing, k, v)
>> dict.__setitem__(self, k, v)
>>   def __setstate__(self, adict):
>> pass
>>
>>
>> doesn't work around the problem for me because I need the actual value
>> of self.shape from the original instance. But I only need it for sanity
>> checking, and under the assumption that the original instance was valid,
>> I don't need to do this when unpickling. I haven't managed to find a
>> workaround that exploits that (yet?). Cheers.
> 
> I've been playing around with __getnewargs__(), and it looks like you can 
> get it to work with a custom __new__(). Just set the shape attribute there 
> rather than in __init__():
> 
> $ cat pickle_dict_subclass.py 
> import pickle
> 
> 
> class A(dict):
> def __new__(cls, keys=(), shape=None):
> obj = dict.__new__(cls)
> obj.shape = shape
> return obj
> 
> def __init__(self, keys=(), shape=None):
> print("INIT")
> for key in keys:
> self[key] = None
> print("EXIT")
> 
> def __setitem__(self, key, val):
> print(self.shape, ": ", key, " <-- ", val, sep="")
> super().__setitem__(key, val)
> 
> def __getnewargs__(self):
> print("GETNEWARGS")
> return ("xyz", self.shape)
> 
> x = A([1, 2, 3], shape="SHAPE")
> x["foo"] = "bar"
> print("pickling:")
> s = pickle.dumps(x)
> print("unpickling:")
> y = pickle.loads(s)
> print(y)
> $ python3 pickle_dict_subclass.py 
> INIT
> SHAPE: 1 <-- None
> SHAPE: 2 <-- None
> SHAPE: 3 <-- None
> EXIT
> SHAPE: foo <-- bar
> pickling:
> GETNEWARGS
> unpickling:
> SHAPE: 1 <-- None
> SHAPE: 2 <-- None
> SHAPE: 3 <-- None
> SHAPE: foo <-- bar
> {1: None, 2: None, 3: None, 'foo': 'bar'}
> 
> It's not clear to me how the dict items survive when they are not included 
> in the __getnewargs__() result, but apparently they do.
> 
> 
> 

Thanks Peter. The docs for pickle say "When a class instance is
unpickled, its __init__() method is usually not invok

Re: in a pickle

2019-03-06 Thread duncan smith
On 07/03/2019 00:18, Ethan Furman wrote:
> On 03/06/2019 10:30 AM, duncan smith wrote:
> 
>>  I've been trying to figure out why one of my classes can be
>> pickled but not unpickled. (I realise the problem is probably with the
>> pickling, but I get the error when I attempt to unpickle.)
>>
>> A relatively minimal example is pasted below.
>>
>>
>> --> import pickle
>> --> class test(dict):
>> ...    def __init__(self, keys, shape=None):
>> ...    self.shape = shape
>> ...    for key in keys:
>> ...    self[key] = None
>> ...    def __setitem__(self, key, val):
>> ...    print (self.shape)
>> ...    dict.__setitem__(self, key, val)
>> ...
>> --> x = test([1,2,3])
>> None
>> None
>> None
>> --> s = pickle.dumps(x)
>> --> y = pickle.loads(s)
>> Traceback (most recent call last):
>>    File "", line 1, in 
>>  y = pickle.loads(s)
>>    File "", line 8, in __setitem__
>>  print (self.shape)
>> AttributeError: 'test' object has no attribute 'shape'
> 
> It seems to me the problem only exists because you are trying to print
> `self.shape` -- stop printing it, and everything works normally.
> 
> What problem are you actually trying to solve?
> 
> -- 
> ~Ethan~
> 

Accessing self.shape at that point in the code. In my actual code I am
updating a secondary data structure and doing some sanity checking at
that point. That fails because self.shape doesn't exist. Peter's post
makes it clear that although it fails in my __setitem__() it is not via
calling my __init__(). It must be being called by pickle.loads(). So on
the basis that pickle.loads() will restore the secondary data structure
correctly, and having no need for sanity checking assuming the original
object was valid, then the problem almost goes away. e.g. I can catch
the AttributeError and just call dict.__setitem__(self, key, val). That
I still need to test. I might yet employ Peter's solution (for when I
thought I needed access to self.shape and other attributes, which I now
doubt) which is to define __new__() so that the relevant attributes do
exist when they are required.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: doctest random output?

2019-04-17 Thread duncan smith
On 28/08/2017 20:17, Leam Hall wrote:
> On 08/28/2017 11:40 AM, Dennis Lee Bieber wrote:
> 
> ... a bunch of good stuff ...
> 
> I'm (re-)learning python and just trying make sure my function works.
> Not at the statistical or cryptographic level.   :)
> 
> Thanks!
> 
> Leam

If it's supposed to generate values that follow a particular
distribution, and they don't, then it doesn't work. I had a bunch of
functions for generating values from various distributions. My line
manager told me to just set the seed so that the outputs were
deterministic. Worse than no test at all. It relied on my original
implementation (that generated the values for comparison) working, and
only tested if the implementation (of random.random() or my code) had
changed. So I ignored my boss and simply generated samples of values and
tested using a KS goodness of fit test. The tests should fail 5% of the
time. Run them a few times and check that no individual test is failing
consistently. I don't see how you can do much better than that. Of
course, this doesn't relate directly to doctest.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Import module from a different subdirectory

2019-05-16 Thread duncan smith
On 16/05/2019 22:50, Rich Shepard wrote:
> I'm developing a Python3 application using Python3-3.7.3 and
> virtualenv-16.5.0 on a Slackware-14.2 host.
> 
> The project directory contains subdirectories, including gui/ (with the
> tkinter views) and classes/ with the SQLAlchemy model.py.
> 
> Within the gui/ directory as the cwd testing the code for a view needs to
> import the 'model' module from the class/ subdirectory and I have not been
> able to use the correct syntax to import that module. The view module
> starts
> with:
> 
> from classes import model as m
> import data_entry_validators
> from commonDlgs import LabelInput
> 
> import tkinter as tk
> from tkinter import ttk
> from datetime import datetime
> 
> When I try to display this UI view I get an error:
> 
> $ python3 test_act_de.py Traceback (most recent call last):
>   File "test_act_de.py", line 1, in 
>     from classes import model as m
> ModuleNotFoundError: No module named 'classes'
> 
> Of course, 'classes' is a subdirectory not a module. My web searches have
> not found the proper syntax to import the 'model' module from the separate
> subdirectory classes and I need to learn how to do this.
> 
> Regards,
> 
> Rich

You could make the subdirectories Python packages. Google (or better
DuckDuckGo) is your friend.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


duck typing / library functions

2019-09-12 Thread duncan smith
Hello,
  The problem I have relates to writing algorithmic code that can
handle types with a given API, but where some of the required
functionality is implemented as library functions rather than methods.
Specifically I have code that uses numpy arrays, but I want to adapt it
to use sparse arrays that have a (sufficiently) similar API.

For the most part the existing code will work without alteration, but
wherever I have a numpy function (rather than array method) I have an
issue. In some cases I can find a corresponding method, but not for e.g.
numpy.isnan. Roughly the structure of the existing code is:



import numpy


class Algorithm(object):
def __init__(self, arrays):
self.arrays = arrays

def run(self):
shape = a shape calculated from info in self.arrays
arr = numpy.ones(shape)
# other stuff using array methods and numpy functions



I could replace "numpy" above with "sparse" and it would work fine (as
long as the arrays passed to __init__ were of the appropriate type). I
could move the import inside the class and make it conditional on the
types in self.arrays. I could potentially replace the function calls
with something based only on method calls (although that turns out to be
awkward for some functions). But I'm thinking about something more
introspective, maybe identifying the relevant module / package from the
array type and importing the relevant module on the fly? Any ideas? TIA.

Duncan

p.s. Ideally the solution should work with Python 3 and recent versions
of Python 2.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Instantiating sub-class from super

2019-10-15 Thread duncan smith
On 15/10/2019 21:36, DL Neil wrote:
> On 16/10/19 12:38 AM, Rhodri James wrote:
>> On 14/10/2019 21:55, DL Neil via Python-list wrote:
> ...
> 
>>> It seemed better (at the design-level) to have Man( Person ) and
>>> Woman( Person ) sub-classes to contain the pertinent attributes,
>>> source more detailed and specific questions, and collect such data;
>>> by gender.
>>
>> Knowing a lot of Trans people as I do, may I gently suggest that this
>> solution will find many and varied ways of coming back to bite you?
> 
> 
> [with more respect than the previous humor]
> 
> 
> You are absolutely correct. The use-case was (over-)simplified, per
> list-advice.
> 
> That said, if a "trans" person has ovaries or testes (for example) then
> a non-traditional sexual identification is irrelevant - for medical
> purposes. Diseases in those areas (and now I'm a long way from a
> research questionnaire and from Python - but this is roughly how it was
> explained to me) still apply, and sadly, may in-fact be considerably
> complicated by any medical processes that may have contributed to a
> transition.
> 
> So, yes, the "label" is unimportant - except to politicians and
> statisticians, who want precise answers from vague collections of
> data... (sigh!)
> 

[snip]

No not (real) statisticians. People often want us to provide precise
answers, but they don't often get them.

"It ain’t what you don’t know that gets you into trouble. It’s what you
know for sure that just ain’t so." (Mark Twain - perhaps)

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Instantiating sub-class from super

2019-10-16 Thread duncan smith
On 16/10/2019 04:41, DL Neil wrote:
> On 16/10/19 1:55 PM, duncan smith wrote:
>> On 15/10/2019 21:36, DL Neil wrote:
>>> On 16/10/19 12:38 AM, Rhodri James wrote:
>>>> On 14/10/2019 21:55, DL Neil via Python-list wrote:
>>> ...
>>> So, yes, the "label" is unimportant - except to politicians and
>>> statisticians, who want precise answers from vague collections of
>>> data... (sigh!)
>>>
>>
>> [snip]
>>
>> No not (real) statisticians. People often want us to provide precise
>> answers, but they don't often get them.
>>
>> "It ain’t what you don’t know that gets you into trouble. It’s what you
>> know for sure that just ain’t so." (Mark Twain - perhaps)
> 
> +1
> 
> Although, you've undoubtedly heard people attempt to make claims of
> having 'accurate figures' (even, "that came from Stats") when you told
> them that the limitations and variations rendered the exercise laughable...
> 
> My favorite (of the moment) is a local computer store who regularly
> offer such gems as: (underneath the sales (web-) page for an upmarket
> *desktop* computer)  "people who bought this also bought" followed by at
> least two portable PC carry cases. They must be rather large carry-bags!
> (along with such surprises as keyboard, mouse, ...)
> 
> This morning I turned-down a study for a political group. One study has
> already been completed and presented. The antagonist wanted an A/B
> comparison (backing his 'side', of course). I mildly suggested that I
> would do it, if he'd also pay me to do an A/B/C study, where 'C' was a
> costing - the economic opportunity cost of 'the people' waiting for 'the
> government' to make a decision - (and delaying that decision by waiting
> for "study" after "study" - The UK and their (MPs') inability to decide
> "Brexit" a particularly disastrous illustration of such)
> 
> 
> Sorry, don't want to incur the anger of the list-gods - such
> calculations would be performed in Python (of course)

Clearly, all such analyses should be done in Python. Thank God for rpy2,
otherwise I'd have to write R code. It's bad enough having to read it
occasionally to figure out what's going on under the hood (I like
everything about R - except the syntax).

I have too many examples of people ignoring random variation, testing
hypotheses on the data that generated the hypotheses, shifting the
goalposts, using cum / post hoc ergo propter hoc reasoning, assuming
monocausality etc. In some areas these things have become almost
standard practice (and they don't really hinder publication as long as
they are even moderately well hidden). Of course, it's often about
policy promotion, and the economic analyses can be just as bad (e.g.
comparing the negative impacts of a policy on the individual with the
positive impacts aggregated over a very large population). And if it's
about policy promotion a press release is inevitable. So we just need to
survey the news media for specific examples. Unfortunately there's no
reliable service for telling us what's crap and what isn't. (Go on,
somebody pay me, all my data processing / re-analysis will be in Python
;-).)

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Instantiating sub-class from super

2019-10-16 Thread duncan smith
On 16/10/2019 19:52, MRAB wrote:
> On 2019-10-16 19:43, duncan smith wrote:
>> On 16/10/2019 04:41, DL Neil wrote:
>>> On 16/10/19 1:55 PM, duncan smith wrote:
>>>> On 15/10/2019 21:36, DL Neil wrote:
>>>>> On 16/10/19 12:38 AM, Rhodri James wrote:
>>>>>> On 14/10/2019 21:55, DL Neil via Python-list wrote:
>>>>> ...
>>>>> So, yes, the "label" is unimportant - except to politicians and
>>>>> statisticians, who want precise answers from vague collections of
>>>>> data... (sigh!)
>>>>>
>>>>
>>>> [snip]
>>>>
>>>> No not (real) statisticians. People often want us to provide precise
>>>> answers, but they don't often get them.
>>>>
>>>> "It ain’t what you don’t know that gets you into trouble. It’s what you
>>>> know for sure that just ain’t so." (Mark Twain - perhaps)
>>>
>>> +1
>>>
>>> Although, you've undoubtedly heard people attempt to make claims of
>>> having 'accurate figures' (even, "that came from Stats") when you told
>>> them that the limitations and variations rendered the exercise
>>> laughable...
>>>
>>> My favorite (of the moment) is a local computer store who regularly
>>> offer such gems as: (underneath the sales (web-) page for an upmarket
>>> *desktop* computer)  "people who bought this also bought" followed by at
>>> least two portable PC carry cases. They must be rather large carry-bags!
>>> (along with such surprises as keyboard, mouse, ...)
>>>
>>> This morning I turned-down a study for a political group. One study has
>>> already been completed and presented. The antagonist wanted an A/B
>>> comparison (backing his 'side', of course). I mildly suggested that I
>>> would do it, if he'd also pay me to do an A/B/C study, where 'C' was a
>>> costing - the economic opportunity cost of 'the people' waiting for 'the
>>> government' to make a decision - (and delaying that decision by waiting
>>> for "study" after "study" - The UK and their (MPs') inability to decide
>>> "Brexit" a particularly disastrous illustration of such)
>>>
>>>
>>> Sorry, don't want to incur the anger of the list-gods - such
>>> calculations would be performed in Python (of course)
>>
>> Clearly, all such analyses should be done in Python. Thank God for rpy2,
>> otherwise I'd have to write R code. It's bad enough having to read it
>> occasionally to figure out what's going on under the hood (I like
>> everything about R - except the syntax).
>>  > I have too many examples of people ignoring random variation, testing
>> hypotheses on the data that generated the hypotheses, shifting the
>> goalposts, using cum / post hoc ergo propter hoc reasoning, assuming
>> monocausality etc. In some areas these things have become almost
>> standard practice (and they don't really hinder publication as long as
>> they are even moderately well hidden). Of course, it's often about
>> policy promotion, and the economic analyses can be just as bad (e.g.
>> comparing the negative impacts of a policy on the individual with the
>> positive impacts aggregated over a very large population). And if it's
>> about policy promotion a press release is inevitable. So we just need to
>> survey the news media for specific examples. Unfortunately there's no
>> reliable service for telling us what's crap and what isn't. (Go on,
>> somebody pay me, all my data processing / re-analysis will be in Python
>> ;-).)
>>
> Even when using Python, you have to be careful:
> 
> Researchers find bug in Python script may have affected hundreds of studies
> https://arstechnica.com/information-technology/2019/10/chemists-discover-cross-platform-python-scripts-not-so-cross-platform/
> 

Yes, the problem of standing on the shoulders of others (not necessarily
giants). I assume the problematic code was tested on an OS that happened
to sort the results as required. Years ago I had a similar issue with a
race condition, but we caught it because I developed the code under
Linux and everybody else on the project ran it under Windows (where the
bug surfaced). Note to self: before I do any more parallel programming
check out how to test for race conditions.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Instantiating sub-class from super

2019-10-19 Thread duncan smith
On 18/10/2019 23:57, DL Neil wrote:
> On 17/10/19 7:52 AM, MRAB wrote:
>> On 2019-10-16 19:43, duncan smith wrote:
>>> On 16/10/2019 04:41, DL Neil wrote:
>>>> On 16/10/19 1:55 PM, duncan smith wrote:
>>>>> On 15/10/2019 21:36, DL Neil wrote:
>>>>>> On 16/10/19 12:38 AM, Rhodri James wrote:
>>>>>>> On 14/10/2019 21:55, DL Neil via Python-list wrote:
>>>>>> ...
>>>>>> So, yes, the "label" is unimportant - except to politicians and
>>>>>> statisticians, who want precise answers from vague collections of
>>>>>> data... (sigh!)
>>>>>>
>>>>>
>>>>> [snip]
>>>>>
>>>>> No not (real) statisticians. People often want us to provide precise
>>>>> answers, but they don't often get them.
>>>>>
>>>>> "It ain’t what you don’t know that gets you into trouble. It’s what
>>>>> you
>>>>> know for sure that just ain’t so." (Mark Twain - perhaps)
>>>>
>>>> +1
>>>>
>>>> Although, you've undoubtedly heard people attempt to make claims of
>>>> having 'accurate figures' (even, "that came from Stats") when you told
>>>> them that the limitations and variations rendered the exercise
>>>> laughable...
>>>>
>>>> My favorite (of the moment) is a local computer store who regularly
>>>> offer such gems as: (underneath the sales (web-) page for an upmarket
>>>> *desktop* computer)  "people who bought this also bought" followed
>>>> by at
>>>> least two portable PC carry cases. They must be rather large
>>>> carry-bags!
>>>> (along with such surprises as keyboard, mouse, ...)
>>>>
>>>> This morning I turned-down a study for a political group. One study has
>>>> already been completed and presented. The antagonist wanted an A/B
>>>> comparison (backing his 'side', of course). I mildly suggested that I
>>>> would do it, if he'd also pay me to do an A/B/C study, where 'C' was a
>>>> costing - the economic opportunity cost of 'the people' waiting for
>>>> 'the
>>>> government' to make a decision - (and delaying that decision by waiting
>>>> for "study" after "study" - The UK and their (MPs') inability to decide
>>>> "Brexit" a particularly disastrous illustration of such)
>>>>
>>>>
>>>> Sorry, don't want to incur the anger of the list-gods - such
>>>> calculations would be performed in Python (of course)
>>>
>>> Clearly, all such analyses should be done in Python. Thank God for rpy2,
>>> otherwise I'd have to write R code. It's bad enough having to read it
>>> occasionally to figure out what's going on under the hood (I like
>>> everything about R - except the syntax).
>>>  > I have too many examples of people ignoring random variation, testing
>>> hypotheses on the data that generated the hypotheses, shifting the
>>> goalposts, using cum / post hoc ergo propter hoc reasoning, assuming
>>> monocausality etc. In some areas these things have become almost
>>> standard practice (and they don't really hinder publication as long as
>>> they are even moderately well hidden). Of course, it's often about
>>> policy promotion, and the economic analyses can be just as bad (e.g.
>>> comparing the negative impacts of a policy on the individual with the
>>> positive impacts aggregated over a very large population). And if it's
>>> about policy promotion a press release is inevitable. So we just need to
>>> survey the news media for specific examples. Unfortunately there's no
>>> reliable service for telling us what's crap and what isn't. (Go on,
>>> somebody pay me, all my data processing / re-analysis will be in Python
>>> ;-).)
>>>
>> Even when using Python, you have to be careful:
>>
>> Researchers find bug in Python script may have affected hundreds of
>> studies
>> https://arstechnica.com/information-technology/2019/10/chemists-discover-cross-platform-python-scripts-not-so-cross-platform/
> 
> 
> 
> I think both of our 'Python' comments were made tongue-in-cheek. Sadly
> the tool won't guarantee the result...
> 
> 
> At my first research project, before I'd even

Re: stuck on time

2019-12-08 Thread duncan smith
On 07/12/2019 17:48, RobH wrote:
> I am trying to do this project on a pi zero:
> 
> http://frederickvandenbosch.be/?p=1365
> 
> After overcoming a few errors, I now have the display working and the
> start of the code showing on the display, that being the time.
> 
> It doesn't move on to the next part of the code ie, no rectangle drawn
> def display_time():
> # Collect current time and date
> if(time_format):
>     current_time = time.strftime("%I:%M")<<< stays at this line
> else:
>     current_time = time.strftime("%H:%M")
>    
> current_date = time.strftime("%d/%m/%Y")
> 
> # Clear image buffer by drawing a black filled box
> draw.rectangle((0,0,width,height), outline=0, fill=0
> 
> In the original code you can see that the author used the
> Minecraftia.ttf font, but I had to download the Minecraftia-Regular.ttf
> font. The link the author gave took me to that.
> 
> The Minecraft-Regular.ttf font produced errors when the code was run,
> saying at the end of the error that it cannot open resource.
> 
> I then used the NotoSerif-Regular.ttf font which let the code start
> without error, but as above it is stuck at the time.

One thing you certainly need to do is add a closing brace to the last line.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


distinguishable matplotlib colours / symbols / line styles

2019-12-16 Thread duncan smith
Hello,
  Not really specific to Python or matplotlib (but that's what I'm
using). I'm looking for a good combination of colours and symbols for
scatter plots, and combination of colours and line styles for line
plots. Too often I find myself producing these when I don't know whether
they will end up being printed in colour or greyscale. It would be a lot
easier if I could produce decent plots that would work well in either
case. I can find various stuff on colour palettes, but pretty much
nothing on symbols or line styles. Is anybody aware of an approach that
works well? I'm aware of issues with colour blindness and RGB versus
CMYK. Ideally I'd have something that I could just use that would deal
with all these issues. TIA.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: distinguishable matplotlib colours / symbols / line styles

2019-12-16 Thread duncan smith
On 16/12/2019 21:08, DL Neil wrote:
> On 17/12/19 5:19 am, Chris Angelico wrote:
>> On Tue, Dec 17, 2019 at 3:16 AM duncan smith 
>> wrote:
>>>
>>> Hello,
>>>    Not really specific to Python or matplotlib (but that's what I'm
>>> using). I'm looking for a good combination of colours and symbols for
>>> scatter plots, and combination of colours and line styles for line
>>> plots. Too often I find myself producing these when I don't know whether
>>> they will end up being printed in colour or greyscale. It would be a lot
>>> easier if I could produce decent plots that would work well in either
>>> case. I can find various stuff on colour palettes, but pretty much
>>> nothing on symbols or line styles. Is anybody aware of an approach that
>>> works well? I'm aware of issues with colour blindness and RGB versus
>>> CMYK. Ideally I'd have something that I could just use that would deal
>>> with all these issues. TIA.
>>>
>>
>> I'd recommend looking at the Web Content Accessibility Guidelines
>> published by the W3C; there are a number of tools out there that are
>> designed to help you pick colours for a web page, and the same sorts
>> of rules will apply here, I think.
>>
>> Also, thank you for even *thinking* about this. A lot of people don't. :)
> 
> +1
> 
> We spend a lot of time teaching this topic (non-Python courses). It
> receives a lot of often highly-polarised comments/discussion. Many folk
> have their 'eyes opened' to an issue which has not affected them
> personally. Some even have to be informed that it is a legal obligation
> in their jurisdiction. However, it also receives the highest number of
> 'why do I have to learn this stuff' complaints...
> 
> I learned (way back) that the incidence of "color blindness" is far
> higher than I had imagined. Secondly, that it affects males more than
> females. Thirdly, that calling it "blindness" is a bit of a misnomer,
> because whilst people often can't see red 'correctly' (most common
> symptom), they do see something (it varies). Which is why they are
> permitted to drive vehicles (traffic lights: red, amber/yellow, green -
> and arrows; plus stop/brake lights), but why many smart-phone apps/web
> pages which encode information-relevance (red is 'wrong' and green/blue
> is acceptable) can become almost unusable (without other cues).
> 
> Those key-words: "accessibility guidelines" will yield a swag of useful
> tools - ignore the ones which are basically 'help choose the color of my
> web page/color palette, because they are often aiming (only) for
> 'pretty'. The best tools enumerate the efficacy of fg/bg
> color-combinations, allowing one to experiment; and will enumerate
> grey-scale variation/comparisons.
> 
> 
> Hmmm, note to self (you've inspired me to specifically review/critique
> the printing-from-screen action): what happens when we take a
> color-checked screen display and print same but end-up viewing it as
> monochrome/grey-scale output? Probably not a main-stream demand, but
> worth tossing at the WCAG experts...

A paper I recently published contained a number of colour images (graphs
- of the nodes and edges variety - and line plots). But it turned out
that the funding we used to have for colour printing charges no longer
exists. So the print version is greyscale with links to online colour
images. Anyone accessing the paper online will be able to download a pdf
with colour images. I don't yet know whether there will be funding for
colour printing charges for the paper I'm currently working on, yet I'm
generating plots. So something that would work for all these possible
end points would be ideal. That brings up the issue of distinguishable
symbols / markers (for scatter plots) and distinguishable line styles
(for line plots). i.e. They might help for greyscale without being too
distracting for colour images. So a single image would do for all the
possible end points. I'm not 100% happy about readers of the print
journal (probably) having to go online to make sense of my recent paper,
but that's done now.

BTW, I've gone with the Seaborn qualitative 'colorblind' palette for
now. Still tinkering with different markers / marker sizes / line
weights etc. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: What I learned today

2020-02-14 Thread duncan smith
On 14/02/2020 23:21, Dan Stromberg wrote:
> On Fri, Feb 14, 2020 at 3:10 PM Stefan Ram  wrote:
> 
>>   By trial and error (never read documentation!) I found
>>   that you can count the number of e's in a text by just
>>
>> Counter( text ).get( 'e' )
>>
>>   (after »from collections import Counter« that is).
>>
> Even simpler, though not suitable for all situations:
> "abcbdbe".count("b")
> 

[snip]

And by far the quickest way (that I've found) for counting the number of
set bits in an int.

>>> def popcount(n):
cnt = 0
while n:
n &= n-1
cnt += 1
return cnt

>>> import timeit
>>> timeit.timeit("popcount(19847998494279)", "from __main__ import
popcount", number=1)

0.034410387044772506
>>> timeit.timeit("bin(19847998494279).count('1')", number=1)

0.004501901799812913
>>>

OK, things turn around for large integers with very few set bits. But
for my use case bin(n).count('1') wins hands down (which surprised me a
little).

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


queue versus list

2020-03-19 Thread duncan smith
Hello,
  I have generator code along the following lines,


Q = queue.Queue()
Q.put(x)
while not Q.empty():
x = Q.get()
if :
yield x
else:
  Q.put()


If I change it to,


Q = []
Q.append(x)
for x in Q:
if :
yield x
else:
  Q.append()


then it runs approximately 3 times more quickly. I seem to remember
(from somewhere) that the language specification doesn't guarantee that
the latter will continue to work, but that it's unlikely that future
implementations will break it. Apart from that, is there any good reason
why I shouldn't use the list? Is there another approach that might avoid
the overhead of using a queue? Results from profiling the two
implementations below in case it makes anything clearer. TIA.

Duncan


 88752234 function calls in 68.603 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
10.0000.000   68.603   68.603 :1()
 1009   18.3430.018   68.6020.068
bit_trees.py:699(paired_tries_search)
  84259346.1160.000   10.3730.000 bit_trees.py:751(hamdist)
10.0000.0000.0000.000 bit_trees.py:809(cum_shape)
  23508675.9700.000   18.2000.000 queue.py:115(put)
  23508676.1140.000   16.6390.000 queue.py:147(get)
10.0000.0000.0000.000 queue.py:199(_init)
  47017351.9510.0002.6860.000 queue.py:202(_qsize)
  23508671.1490.0001.5120.000 queue.py:206(_put)
  23508671.0420.0001.4460.000 queue.py:210(_get)
10.0000.0000.0000.000 queue.py:27(__init__)
  23508682.9910.0004.3970.000 queue.py:90(empty)
30.0000.0000.0000.000 threading.py:215(__init__)
  47017342.6610.0003.7420.000 threading.py:239(__enter__)
  47017342.0970.0002.7190.000 threading.py:242(__exit__)
  47017342.4830.0003.8720.000 threading.py:254(_is_owned)
  47017348.1850.000   12.0570.000 threading.py:334(notify)
10.0000.0000.0000.000 {built-in method
_thread.allocate_lock}
  84259341.8740.0001.8740.000 {built-in method builtins.bin}
10.0000.000   68.603   68.603 {built-in method
builtins.exec}
  47017360.7350.0000.7350.000 {built-in method builtins.len}
  47017341.0800.0001.0800.000 {method '__enter__' of
'_thread.lock' objects}
  47017340.6220.0000.6220.000 {method '__exit__' of
'_thread.lock' objects}
  47017341.3890.0001.3890.000 {method 'acquire' of
'_thread.lock' objects}
  23508670.3630.0000.3630.000 {method 'append' of
'collections.deque' objects}
  84259342.3840.0002.3840.000 {method 'count' of 'str'
objects}
10.0000.0000.0000.000 {method 'disable' of
'_lsprof.Profiler' objects}
  47017340.6490.0000.6490.000 {method 'items' of 'dict'
objects}
  23508670.4040.0000.4040.000 {method 'popleft' of
'collections.deque' objects}



32331417 function calls in 26.992 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
10.1420.142   26.992   26.992 :1()
 1009   16.7410.017   26.8510.027
bit_trees.py:699(paired_tries_search)
  84259345.2700.0009.1750.000 bit_trees.py:755(hamdist)
10.0000.0000.0000.000 bit_trees.py:813(cum_shape)
  84259341.5760.0001.5760.000 {built-in method builtins.bin}
10.0000.000   26.992   26.992 {built-in method
builtins.exec}
10.0000.0000.0000.000 {built-in method builtins.len}
  23508670.3100.0000.3100.000 {method 'append' of 'list'
objects}
  84259342.3290.0002.3290.000 {method 'count' of 'str'
objects}
10.0000.0000.0000.000 {method 'disable' of
'_lsprof.Profiler' objects}
  47017340.6240.0000.6240.000 {method 'items' of 'dict'
objects}


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: queue versus list

2020-03-19 Thread duncan smith
On 19/03/2020 20:40, MRAB wrote:
> On 2020-03-19 20:08, duncan smith wrote:
>> Hello,
>>    I have generator code along the following lines,
>>
>>
>> Q = queue.Queue()
>> Q.put(x)
>> while not Q.empty():
>>  x = Q.get()
>>  if :
>>  yield x
>>  else:
>>    Q.put()
>>
>>
>> If I change it to,
>>
>>
>> Q = []
>> Q.append(x)
>> for x in Q:
>>  if :
>>  yield x
>>  else:
>>    Q.append()
>>
>>
>> then it runs approximately 3 times more quickly. I seem to remember
>> (from somewhere) that the language specification doesn't guarantee that
>> the latter will continue to work, but that it's unlikely that future
>> implementations will break it. Apart from that, is there any good reason
>> why I shouldn't use the list? Is there another approach that might avoid
>> the overhead of using a queue? Results from profiling the two
>> implementations below in case it makes anything clearer. TIA.
>>
> [snip]

Thanks MRAB and Paul,

> A number of points:
> 
> 1. I'm not sure it's a good idea to mutate a list while iterating over it.
> 

AFAICR that's the bit that depends on the implementation of lists. As
long as iteration involves incrementing an index and indexing into the
list it will work.

> 2. The example with the list retains all of the results, whereas the one
> with the queue consumes them.
> 
> 3. Queues are thread-safe because they're used for communication between
> threads, but lists aren't thread-safe; that might be something to do
> with the difference.
> 
> 4. If it doesn't need to be thread-safe, why not try deques instead?

Bingo. Performance is indistinguishable from that of the list. Thread
safety is unimportant (for my purposes), but the docs at
https://docs.python.org/2/library/collections.html#collections.deque
state "Deques support thread-safe, memory efficient appends and pops
from either side of the deque with approximately the same O(1)
performance in either direction". Looking at the output from the
profiler for my implementation with the queue it looks like a deque is
being used but there's various other stuff happening relating to thread
safety. Is the lesson from this that if all you're doing is appending
and popping, then use a deque (rather than a queue) even if you want
thread safety? (It makes quite a difference in performance for my use
case.) Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: queue versus list

2020-03-20 Thread duncan smith
On 20/03/2020 21:57, Barry wrote:
> 
> 
>> On 20 Mar 2020, at 00:42, duncan smith  wrote:
>>
>> Bingo. Performance is indistinguishable from that of the list. Thread
>> safety is unimportant (for my purposes), but the docs at
>> https://docs.python.org/2/library/collections.html#collections.deque
>> state "Deques support thread-safe, memory efficient appends and pops
>> from either side of the deque with approximately the same O(1)
> 
> I did a performance test of list vs sequel as a fifo that had 10k elements in 
> them.
> Then I timed insert elements at the tail and popping from the head.
> 
> Duque is 10x faster then list, rest run macOS python 3.8
> 
> Barry
> 
> 

Yes, I avoided the cost of popping elements from the head by just
iterating over the list. So the timings came out very similar. But the
deque allows me to remove those elements efficiently. It's also neat
because it lets me switch from what is basically a breadth first search
of a graph to a depth first search by simply changing popleft() to
popright(). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


inheriting from long on Python2 or int on Python3

2020-03-21 Thread duncan smith
Hello,
  I have a class I wrote for Python2 that needs to be made to work
on Python3. On Python2 it inherited from long, so it needs to inherit
from int on Python3. The following works, but I'm not sure it's the
cleanest solution. It's the first time I've wanted / needed to make a
parent class conditional on the Python version, and I'm not sure it's
even a good idea. I used inheritance because I wanted to perform bitwise
operations on instances, but I could probably switch to composition.
Just wondering how bad the following really is (in a code smell kind of
a away). TIA.

Duncan


import sys

class C_hash(int if sys.version_info.major >= 3 else long):
def __new__(cls, bits, m, N=1):
obj = super(C_hash, cls).__new__(cls, bits)
obj._m = m
obj._N = N
return obj
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: inheriting from long on Python2 or int on Python3

2020-03-21 Thread duncan smith
On 21/03/2020 21:55, Skip Montanaro wrote:
>> import sys
>>
>> class C_hash(int if sys.version_info.major >= 3 else long):
> ...
> 
> There's nothing incorrect with your solution. Chris offered another. I
> was going to suggest you consider running your code through 2to3 to
> see what it does. I created a C_hash class similar to yours:
> 
> class C_hash(long):
> def __new__(cls, bits, m, N=1):
> obj = super(C_hash, cls).__new__(cls, bits)
> obj._m = m
> obj._N = N
> return obj
> 
> def meth(self):
> pass
> 
> When I ran it through 2to3, it didn't make any transformations, which
> I thought odd. Still, assuming I did something wrong, you might poke
> around with it a bit to see if you can get it to do the work for you.
> 
> Skip
> 

According to the docs (https://docs.python.org/2/library/2to3.html) it
should have renamed long to int. I also tried something closer to
Chris's suggestion,


try:
long
except NameError:
long = int

class C_hash(long):
# etc.


but (tentatively) opted for something that avoided creating new names at
the module level. If it doesn't look too terrible I might stick with it
for now (although I am quite tempted to go back to the above). Thanks
Skip and Chris.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Confused about np.polyfit()

2020-07-19 Thread duncan smith
On 19/07/2020 11:19, Dino wrote:
> 
> Hi, I am looking at someone else's code trying to understand their use
> of numpy.polyfit.
> 
> My understanding was that you can use it to fit polynomials, but
> apparently, the original author has used it for logarithmic and
> exponential curves as well this way:
> 
> Logarithmic
> 
>     def fit_model(self):
>     self.coefficients = np.polyfit(np.log(self.x), self.y, 1)
> 
> Exponential
> 
>     def fit_model(self):
>     self.coefficients = np.polyfit(self.x, np.log(self.y), 1)
> 
> is this an ok use of np.polyfit? Will it yield the expected result.
> 
> Thanks

It depends on what you expect the result to be. There's nothing
inherently wrong with transforming variables before using least squares
fitting. Whether it gives you the "best" estimates for the coefficients
is a different issue.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Confused about np.polyfit()

2020-07-19 Thread duncan smith
On 19/07/2020 16:11, Dino wrote:
> On 7/19/2020 4:54 PM, duncan smith wrote:
>>
>> It depends on what you expect the result to be. There's nothing
>> inherently wrong with transforming variables before using least squares
>> fitting. Whether it gives you the "best" estimates for the coefficients
>> is a different issue.
> 
> Thanks a lot for this, Duncan. I guess I have to follow-up questions at
> this point:
> 
> 1) in which ways is this approach sub-optimal?
> 
> 2) what's the "right" way to do it?
> 
> Thank you
> 

You'll have to read up a bit on ordinary least squares (e.g.
https://en.wikipedia.org/wiki/Ordinary_least_squares). It is based on
assumptions that might not necessarily hold for a given model / dataset.
Depending on which assumptions are violated the estimates can be
affected in different ways. It is usual to fit the model, then check the
residuals to see if the assumptions (approximately) hold. If not, it
might indicate a poor model fit or suggest fitting a transformed model
(to estimate the same coefficients while satisfying the assumptions).
e.g. For the latter case,

Y = a + bX

has the same coefficients as

Y/X = a * 1/X + b

but the latter regression might satisfy the assumption of constant
variance for the errors.

Regression analysis is a bit of an art, and it's a long time since I did
any. Ordinary least squares is optimal in a certain sense when the
assumptions hold. When they don't there's no single answer to what the
best alternative is (unless it's employ a good statistician).

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-01 Thread duncan smith
On 01/11/2020 13:38, Bischoop wrote:
> On 2020-11-01, Bischoop  wrote:
>> I'm working on a script i which user inputs letters and then a printed
>> words containing those letters. The scripts works however I can't solve
>> one problem , it prints also words in which these letters occur more 
>> than once.
>> ---
>> Fore example:
>> Letters: at
>> Output: auto, autobahn.
>> ---
>>
>> I supposed to not print word: "autobahn" because I've given it only one
>> letter "a".
>>
> 
> Problem's solved.
> 
> 
> for word in words:
> if all(word.count(x) == letters.count(x) for x in letters):
> print(word)
> -
> 
> Thanks for suggestions, I didn't had to use counter but going to look
> into as it seems useful and interesting.
> 
> Thank You
> 

But this generates the letters counts for each word. They only need to
be generated once (outside the for loop). And using count requires
iterating over the letters / words for each x in letters (rather than
once). For a large enough collection of words you'd notice the difference.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-02 Thread duncan smith
On 02/11/2020 19:09, dn wrote:
> On 02/11/2020 23:29, Bischoop wrote:
>> On 2020-11-01, duncan smith  wrote:
>>>>
>>>
>>> But this generates the letters counts for each word. They only need to
>>> be generated once (outside the for loop). And using count requires
>>> iterating over the letters / words for each x in letters (rather than
>>> once). For a large enough collection of words you'd notice the
>>> difference.
>>>
>>
>> You're pretty much right, it doesn't go too fast.
>> I've done this years ago with Python2, had a bit free time now and
>> wanted to ReLearn Python from starting old projects I've done in past
>> but can't remember how I made it working lol.
> 
> 
> If you have a working Py2 version, once print-statements were changed
> into functions, what errors were thrown-up?
> 
> 
> Multiple loops written in Python are likely to be slower than same in
> compiled code - which was probably part of the motivation for @Terry's
> response. Plus, "re-use" - why write something ourselves if someone else
> has already done the work?
> 
> 
> How about a change of tactics?
> 
> - str.find() or .index() will locate a character within the string
> (starting from character[0]/the left-hand side)
> - if this fails, tears will fall...
> - repeat, from the right
> - if both results are the same character/position, it must be unique
> within the string
> - repeat for each character in "Letters"
> 

[snip]

But he appears to need the count in order to compare against the counts
in letters. I assume letters could be something like 'att', in which
case knowing a word contains more than one 't' doesn't cut it. He could
try iterating over the characters in each word once, checking that no
count from letters is exceeded, and that the number of remaining
characters gives some chance that the relevant counts from letters could
be achieved. In the worst case (e.g. a conforming word) this requires
iterating over all the characters in a word once. That's not to say it
would actually run quicker 'on average' than a solution using Counter.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-03 Thread duncan smith
On 03/11/2020 14:06, Bischoop wrote:
> On 2020-11-03, dn  wrote:
>>
>>
>> The (full) specs are not clear. There's certainly room for 
>> misunderstanding. I'd be happier if I could 'see' a full spec or 
>> recognise a practical application, because then we'd be better able to 
>> discuss facts. Meantime, we try to help with what we have been given...
>>
>>
>> The OP's sample code only realises "conforming word[s]" (good term 
>> BTW!). The snippet does not report any "count". Similarly, there's no 
>> facts to avoid ("I assume") an assumption about whether "Letters" may 
>> include duplication - indeed: whether duplication has meaning or should 
>> be regarded as user-error.
>>
> 
> Let me clarify what I want to do:
> We all know Scrabble game.
> there's a file with Dictionary containing word in each line, the idea is
> to input some letters for example mentioned earlier: att, the script
> supposed to find all words in which letters: 'a','t','t' occur and print
> these words. It supposed not to print word: 'auto' because there's only
> one 't' but words such as: 'toast', 'toasty', 'tolerant' are meeting the
> criteria becase we have in these words one 'a' and two 't' as user has
> input.
> I've checked collections counter but seems to complicated for me at this
> stage yet and I'm struggling with applying it.
> 

>>> from collections import Counter
>>> letters = 'att'
>>> letter_counts = Counter(letters)
>>> word = 'tolerate'
>>> wd_counts = Counter(word)
>>> for char, cnt in letter_counts.items():
print (cnt == wd_counts[char])


True
True
>>> word = 'auto'
>>> wd_counts = Counter(word)
>>> for char, cnt in letter_counts.items():
print (cnt == wd_counts[char])


True
False
>>>

or, equivalent to the above loop, but breaking when the first False is
generated and returning the single Boolean that you're after,

>>> all(cnt == wd_counts[char] for char, cnt in letter_counts.items())
False
>>>

There's still a lot of scope for improvement, but possibly not by doing
simple things.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-03 Thread duncan smith
On 03/11/2020 23:35, Bischoop wrote:
> On 2020-11-03, duncan smith  wrote:
>>>
>>
>>>>> from collections import Counter
>>>>> letters = 'att'
>>>>> letter_counts = Counter(letters)
>>>>> word = 'tolerate'
>>>>> wd_counts = Counter(word)
>>>>> for char, cnt in letter_counts.items():
>>  print (cnt == wd_counts[char])
>>
>>  
>> True
>> True
>>>>> word = 'auto'
>>>>> wd_counts = Counter(word)
>>>>> for char, cnt in letter_counts.items():
>>  print (cnt == wd_counts[char])
>>
>>  
>> True
>> False
>>>>>
>>
>> or, equivalent to the above loop, but breaking when the first False is
>> generated and returning the single Boolean that you're after,
>>
>>>>> all(cnt == wd_counts[char] for char, cnt in letter_counts.items())
>> False
>>>>>
>>
>> There's still a lot of scope for improvement, but possibly not by doing
>> simple things
> 
> 
> lol
> 
> I'm thinking about it for a couple days and you guys coming with few
> ideas to it like it's nothing.
> Pity I've wasted a decade with woman, now probably to old to learn
> anything new.
> 

It looks like you've learnt something about wasting time with women ;-).
Keep tinkering as long as you're enjoying it (or getting paid for it)
and pick up knowledge of relevant data structures and algorithms as you
go. (That's what I've done. I'm actually a statistician.) The above
solution uses bags (implemented in Python as Counter instances), and the
(repeated) generation of the letters bag was hoisted outside the for
loop. (If the dictionary was going to be searched for multiple sets of
letters the generation of the word bags would also be hoisted.) There's
also the idea of breaking out of the loop once the first False is
generated (implemented most neatly by using all). So there might be
something useful to take from it.

A bag was the obvious data structure because the solution depends only
on the unique characters in a string and their frequencies. But that's
thinking about the problem word by word. We actually have a dictionary
of words, and a dictionary can be structured so that it can be searched
much more efficiently than 'word by word'. The particular data structure
I have in mind is not (yet) in the standard Python library. That's maybe
worth looking at when you've got the word by word approach grokked.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-03 Thread duncan smith
On 03/11/2020 22:14, Avi Gross wrote:
> I, too, have wondered what exactly the point was of the required
> functionality for SCRABBLE but note you can extend a current word so
> additional letters may be available in a game but only if they are an exact
> fit to put before, after, or in middle of your word. 
> 
> But this seems to be a fairly simple problem to solve unless I misunderstand
> it. Elegance aside, what would be wrong with this approach.
> 
> - Read a word at a time in a loop from the file of dictionary words
> (non-Python meaning of dictionary.) For each one do the following, perhaps
> using a function:
> 
> Break the current word into a list of individual letters.
> Loop over the letters you want and:
>   If the letter is in the list, remove it and continue
>   Else skip the current word as it is not a match.
> 
> At the end of each of the above loops, you only reached here if all the
> letters were found and removed. If the list is now empty, fine. If it has
> extra remaining letters, also fine by the requirements stated. Letters in
> the list multiple times are removed multiple times.
> 
> The above need not use  list of letters and can be done many other ways but
> seems conceptually simple. Each word is processed once. It can be converted
> to using a list comprehension or something similar by using "all" and so on.
> 
> Or am I missing something about other approaches being better or more
> efficient or ... And, yes, counting may have an advantage as the list does
> not need to be modified repeatedly but creating an object or three also has
> overhead.

[snip]

The Counter approach only requires iterating over the letters once to
construct the letters bag, then each word once to create the relevant
word bag. After that it's (at worst) a couple of lookups and a
comparison for each unique character in letters (for each word).

Your approach requires iteration over the words to create the lists of
characters. Then they are (potentially) iterated over multiple times
looking for the characters in letters. There's also the cost of removing
items from arbitrary positions in the lists. Also, it seems that the
character frequencies must be equal, and your approach only ensures that
the words contain at least the required numbers of characters.

In computational terms, if the first approach is something like O(n+m)
for n letters and words of length m, your algorithm is more like O(nm).
Not to say that it will be slower for all possible letters and
dictionaries, but probably for any realistic cases and a lot slower for
large enough dictionaries.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-04 Thread duncan smith
On 04/11/2020 04:21, Avi Gross wrote:
> Duncan, my comments below yours at end.
> 
> ---YOURS---
> The Counter approach only requires iterating over the letters once to
> construct the letters bag, then each word once to create the relevant word
> bag. After that it's (at worst) a couple of lookups and a comparison for
> each unique character in letters (for each word).
> 
> Your approach requires iteration over the words to create the lists of
> characters. Then they are (potentially) iterated over multiple times looking
> for the characters in letters. There's also the cost of removing items from
> arbitrary positions in the lists. Also, it seems that the character
> frequencies must be equal, and your approach only ensures that the words
> contain at least the required numbers of characters.
> 
> In computational terms, if the first approach is something like O(n+m) for n
> letters and words of length m, your algorithm is more like O(nm).
> Not to say that it will be slower for all possible letters and dictionaries,
> but probably for any realistic cases and a lot slower for large enough
> dictionaries.
> 
> Duncan
> 
> --MINE---
> 
> I appreciate your analysis. I have not looked at the "counter"
> implementation and suspect it does some similar loops within, albeit it may
> be implemented in a compiled language like C++.
> 

Before the introduction of counters I would have used a dict to create a
mapping of characters to counts. That would require iterating over the
characters in a string only once to create / update the dict entries,
and I doubt counters are implemented less efficiently than that.

> I did not write out my algorithm in Python but have done it for myself. It
> runs fast enough with most of the time spent in the slow I/O part. 
> 
> We can agree all algorithms have to read in all the words in a data file.
> There may be ways to store the data such as in a binary tree and even ways
> to thus prune the search as once a node is reached where all required
> letters have been found, all further words qualify below that point. If you
> match say "instant" then instants and instantiation would be deeper in the
> tree and also qualify assuming extra letters are allowed. 

I don't see how a binary tree would be useful. As I've pointed out in
another post, there are other data structures that could be useful. What
I had in mind was a trie (prefix tree). But it's only the distinct
characters and frequencies that are relevant and so I'd exploit that
(and one or two other things) to reduce space and improve search.

We may differ on
> the requirement as I think that the number of repeats for something like
> a,t,t require to be at least as big as in "attend" but that "attention" with
> yet another "t" would also be OK. If I am wrong, fine, but I note the person
> requesting this has admitted a certain lack of credentials while also
> claiming he made up a scenario just for fun. So this is not actually a
> particularly worthy challenge let alone with a purpose.
> 
> My impression is that the average word length would be something small like
> 5-7. The number of words in a dictionary might be 100,000 or more. So if you
> want efficiency, where do you get the bang for the buck? 
> 
> I would argue that a simple test run on all the words might often narrow the
> field to a much smaller number of answers like just a few thousand or even
> much less. Say you test for the presence of "aeiou" in words, in whatever
> order. That might be done from reading a file and filtering out a relatively
> few potential answers. You can save those for  second round to determine if
> they are fully qualified by any additional rules that may involve more
> expensive operations.
> 

Your proposed approach didn't involve any trees (or tries) or filtering
of words. So I don't see how any of this justifies it.

> How fast (or slow) are regular expressions for this purpose? Obviously it
> depends on complexity and something like 
> "^[^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou]
> [^aeiou]*[aeiou] [^aeiou]*$"
> 
> would be easy to construct once but likely horribly inefficient in searching
> and a bit of overkill here. I suspect there is already some simple C
> function that could be used from within python that looks like
> findall(choices, word) that might return how many of the letters in choices
> were found in word and you simply comparer that to length(word) perhaps more
> efficiently.
> 
> It looks easier to check if a character exists in one of the ways already
> discussed within python using a loop as discussed. Something as simple as
> this:
> 
>   needed = "aeiou"
>   trying = "education"
>   found = all([trying.find(each) >= 0  for each in needed ])
>   print(found)
> 
>   trying = "educated"
>   found = all([trying.find(each) >= 0  for each in needed ])
>   print(found)
>   The above prints 
> 
> My point is you can use the above to winnow down possible answers and only
> subject that smaller

Re: Find word by given characters

2020-11-04 Thread duncan smith
On 04/11/2020 19:12, Avi Gross wrote:
> My comments at end:
> 
> -Original Message-
> From: Python-list  On
> Behalf Of duncan smith
> Sent: Wednesday, November 4, 2020 1:09 PM
> To: python-list@python.org
> Subject: Re: Find word by given characters
> 
> On 04/11/2020 04:21, Avi Gross wrote:
>> Duncan, my comments below yours at end.
>>
>> ---YOURS---
>> The Counter approach only requires iterating over the letters once to 
>> construct the letters bag, then each word once to create the relevant 
>> word bag. After that it's (at worst) a couple of lookups and a 
>> comparison for each unique character in letters (for each word).
>>
>> Your approach requires iteration over the words to create the lists of 
>> characters. Then they are (potentially) iterated over multiple times 
>> looking for the characters in letters. There's also the cost of 
>> removing items from arbitrary positions in the lists. Also, it seems 
>> that the character frequencies must be equal, and your approach only 
>> ensures that the words contain at least the required numbers of
> characters.
>>
>> In computational terms, if the first approach is something like O(n+m) 
>> for n letters and words of length m, your algorithm is more like O(nm).
>> Not to say that it will be slower for all possible letters and 
>> dictionaries, but probably for any realistic cases and a lot slower 
>> for large enough dictionaries.
>>
>> Duncan
>>
>> --MINE---
>>
>> I appreciate your analysis. I have not looked at the "counter"
>> implementation and suspect it does some similar loops within, albeit 
>> it may be implemented in a compiled language like C++.
>>
> 
> Before the introduction of counters I would have used a dict to create a
> mapping of characters to counts. That would require iterating over the
> characters in a string only once to create / update the dict entries, and I
> doubt counters are implemented less efficiently than that.
> 
>> I did not write out my algorithm in Python but have done it for 
>> myself. It runs fast enough with most of the time spent in the slow I/O
> part.
>>
>> We can agree all algorithms have to read in all the words in a data file.
>> There may be ways to store the data such as in a binary tree and even 
>> ways to thus prune the search as once a node is reached where all 
>> required letters have been found, all further words qualify below that 
>> point. If you match say "instant" then instants and instantiation 
>> would be deeper in the tree and also qualify assuming extra letters are
> allowed.
> 
> I don't see how a binary tree would be useful. As I've pointed out in
> another post, there are other data structures that could be useful. What I
> had in mind was a trie (prefix tree). But it's only the distinct characters
> and frequencies that are relevant and so I'd exploit that (and one or two
> other things) to reduce space and improve search.
> 
> We may differ on
>> the requirement as I think that the number of repeats for something 
>> like a,t,t require to be at least as big as in "attend" but that 
>> "attention" with yet another "t" would also be OK. If I am wrong, 
>> fine, but I note the person requesting this has admitted a certain 
>> lack of credentials while also claiming he made up a scenario just for 
>> fun. So this is not actually a particularly worthy challenge let alone
> with a purpose.
>>
>> My impression is that the average word length would be something small 
>> like 5-7. The number of words in a dictionary might be 100,000 or 
>> more. So if you want efficiency, where do you get the bang for the buck?
>>
>> I would argue that a simple test run on all the words might often 
>> narrow the field to a much smaller number of answers like just a few 
>> thousand or even much less. Say you test for the presence of "aeiou" 
>> in words, in whatever order. That might be done from reading a file 
>> and filtering out a relatively few potential answers. You can save 
>> those for  second round to determine if they are fully qualified by 
>> any additional rules that may involve more expensive operations.
>>
> 
> Your proposed approach didn't involve any trees (or tries) or filtering of
> words. So I don't see how any of this justifies it.
> 
>> How fast (or slow) are regular expressions for this purpose? Obviously 
>> it depends on complexity and something like "^[^aeiou]*[aeiou] 
>> [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aei

Re: When someone from Britain speaks, Americans hear a "Britishaccent"...

2005-10-07 Thread Duncan Smith
Rocco Moretti wrote:
> Steve Holden wrote:
> 
>>> On Fri, 07 Oct 2005 00:33:43 -, Grant Edwards <[EMAIL PROTECTED]>
>>> wrote:
> 
> 
 For example: In British English one uses a plural verb when the
 subject consists of more than one person.  Sports teams,
 government departments, states, corporations etc. are grammatically
 plural.  In American, the verb agrees with the
 word that is the subject, not how many people are denoted by
 that word.
>>
>>
>> There aren't any universal rules, except possibly "British people
>> speak English while Americans don't". 
> 
> 
> I believe you overgeneralize. :)
> 
> A Welshman would likely be offended if you implied he spoke English, and
> the Scots are notorious for only speaking English when they have too. (I
> remember a news story some years back about a Scottish "lad" who was
> fined/imprisoned for replying to an official court representative with
> "Aye" rather than "Yes".) For that matter there are plenty of people in
> Cornwall and even in London (Cockney) who speak something that is only
> called "English" for lack of a better term.
> 

So English is spoken only in the South East of England, except London?
I think you should also disbar the queen (unless she's already
classified as a Londoner), due to her apparent confusion between the 1st
person singular and 1st person plural :-).

Duncan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: When someone from Britain speaks, Americans hear a "Britishaccent"...

2005-10-08 Thread Duncan Smith
Steve Holden wrote:
> Duncan Smith wrote:
> 
>> Rocco Moretti wrote:
> 
> [...]
> 
>>
>> So English is spoken only in the South East of England, except London?
>> I think you should also disbar the queen (unless she's already
>> classified as a Londoner), due to her apparent confusion between the 1st
>> person singular and 1st person plural :-).
>>
> There are special rules for the monarchs, who are expected to refer to
> themselves in the first person plural.
> 
> Oscar Wilde understood this. When he boasted that he could speak
> extempore for a minute on any subject of a challenger's choosing someone
> shouted "The Queen", to which he replied "The Queen, sir, is not a
> subject".
> 

Yes, although I'm not actually sure where the 'royal we' comes from; and
we (Brits) are technically subjects rather than citizens.  But if
northerners are not English speakers because we use words like 'aye'
(although we say far less understandable things than that) I think the
queen should be similarly classified for using the 'royal we'  :-).

Duncan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: compare list

2005-11-14 Thread Duncan Smith
Brian van den Broek wrote:
> Ben Bush said unto the world upon 2005-11-14 05:51:
> 
>> I have four lists:
>> lisA=[1,2,3,4,5,6,9]
>> lisB=[1,6,5]
>> lisC=[5,6,3]
>> lisD=[11,14,12,15]
>> how can I write a function to compare lisB, lisC and lisD with lisA,
>> if they
>> share two continuous elements (the order does not matter), then return 1,
>> otherwise return 0. For example, lisA, lisB and lisC have 5,6 or 6,5 so
>> these comparison will return 1 but the comparison between lisD and lisA
>> return 0.
>> -- 
>> Thanks!
>> Ben Bush
> 
> 
> Hi Ben,
> 
> the code below is tested no further than shown. And, I'm no expert -- I
> imagine there are better ways. With luck, I'll learn them in a follow-up
> from someone more skilled :-)
> 
 def list_common_continuity_comp(list1, list2):
> for item in list1:
> if item in list2:
> if item + 1 in list1 and item + 1 in list2:
> return True
> return False
> 

[snip]

That's potentially very expensive for large lists, although simply
converting the lists to sets should give a significant speed up.  I
think the following is *possibly* as good a way as any.

>>> def compare(list1, list2):
intersection = sorted(list(set(list1) & set(list2)))
for i in range(len(intersection) - 1):
if intersection[i] == intersection[i+1] - 1:
return True
return False

Duncan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: all possible combinations

2005-07-13 Thread Duncan Smith
rbt wrote:
> On Wed, 2005-07-13 at 10:21 -0400, rbt wrote:
> 
>>Say I have a list that has 3 letters in it:
>>
>>['a', 'b', 'c']
>>
>>I want to print all the possible 4 digit combinations of those 3
>>letters:
>>
>>4^3 = 64
>>
>>
>>abaa
>>aaba
>>aaab
>>acaa
>>aaca
>>aaac
>>...
>>
>>What is the most efficient way to do this? 
> 
> 
> Expanding this to 4^4 (256) to test the random.sample function produces
> interesting results. It never finds more than 24 combinations out of the
> possible 256. This leads to the question... how 'random' is sample ;)
> 
> Try it for yourselves:
> 
> test = list('1234')
> 
> combinations = []
> while 1:
> combo = random.sample(test, 4)
> possibility = ''.join(combo)
> if possibility not in combinations:
> print possibility
> combinations.append(possibility)
> continue
> else:
> continue
> 

There are only 24 possible lists that random.sample(test, 4) can return,
corresponding to the 24 possible orderings of 4 items.  i.e.
random.sample() samples without replacement.  You need to sample with
replacement.

Duncan
-- 
http://mail.python.org/mailman/listinfo/python-list


  1   2   3   >