from:"Francesc"

Re: Python good for data mining?

2007-11-09 Thread Francesc

[I've just seen this thread.  Although it might be a bit late, let me
state a couple of precisions]

On 6 Nov, 03:09, "D.Hering" <[EMAIL PROTECTED]> wrote:
> On Nov 5, 10:29 am, Maarten <[EMAIL PROTECTED]> wrote:
>
> > As forpytables: it is the most elegant programming interface for HDF
> > on any platform that I've encountered so far. Most other platforms
> > stay close the HDF5 library C-interface, which is low-level, and quite
> > complex.PyTableswas written with the end-user in mind, and it shows.
> > One correction though:PyTablesis not a database: it is a storage for
> > (large) arrays, datablocks that you don't want in a database. Use a
> > database for the metadata to find the right file and field within that
> > file. Keep in mind though that I mostly work with externally created
> > HDF-5 files, not with files created inpytables.PyTablesPro has an
> > indexing feature which may be helpful for datamining (if you write the
> > hdf-5 files from python).
>
> > Maarten
>
> Hi Maarten,
>
> I respectfully disagree that HDF5 is not a DB. Its true that HDF5 on
> its prima facie is not relational but rather hierarchical.

Yeah.  This largely depends on what we understand by a DB.  Lately,
RDBMs are used everywhere, and we tend to believe that they are the
only entities that can be truly called DBs.  However, in a less
restrictive view, even a text file can be considered a DB (in fact,
many DBs have been implemented using text files as a base).  So, I
wouldn't say that HDF5 is not a DB, but just that it is not a RDBM ;)

> Hierarchical is truely a much more natural/elegant[1] design from my
> perspective. HDF has always had meta-data capabilities and with the
> new 1.8beta version available, it is increasing its ability with
> 'references/links' allowing for pure/partial relational datasets,
> groups, and files as well as storing self implemented indexing.
>
> The C API is obviously much more low level, and Pytables does not yet
> support these new features.

That's correct.  And although it is well possible that we, PyTables
developers, would end incorporating some relational features to it, we
also recognize that PyTables does not intend (and was never in our
plans) to be a competitor of a pure RDBMS, but rather a *teammate* (as
it is clearly stated in the www.pytables.org home page).

In our opinion, PyTables opens the door to a series of capabilities
that are not found in typical RDBMS, like hierarchical classification,
multidimensional datasets, powerful table entities that are able to
deal with multidimensional columns or nested records, but must
specially, the ability to work with extremely large amounts of data in
a very easy way, without having to renounce to first-class speed.

> [1] Anything/everything that is physical/virtual, or can be conceived
> is hierarchical... if the system itself is not random/chaotic. Thats a
> lovely revelation I've had... EVERYTHING is hierarchical. If it has
> context it has hierarchy.

While I agree that this sentence has a part of truth, it is also known
that a lot of things (perhaps much more than we think) in the universe
enter directly in the domain of random/chaotic ;)

IMO, the wisest path should be recognizing the strengths (and
weaknesses) of each approach and use whatever fits better to your
needs.  If you need the best of both then go ahead and choose a RDBMS
in combination with a hierarchical DB, and utilize the powerful
capabilities of Python to take the most out of them.

Cheers,

Francesc Altet

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: dynamic allocation file buffer

2008-09-15 Thread Francesc

On 12 Set, 14:39, "Aaron \"Castironpi\" Brady" <[EMAIL PROTECTED]>
wrote:
> > A consideration of other storage formats such as HDF5 might
> > be appropriate:
>
> >http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html
>
> > There are, of course, HDF5 tools available for Python.
>
> PyTablescame up within the past few weeks on the list.
>
> "When the file is created, the metadata in the object tree is updated
> in memory while the actual data is saved to disk. When you close the
> file the object tree is no longer available. However, when you reopen
> this file the object tree will be reconstructed in memory from the
> metadata on disk"
>
> This is different from what I had in mind, but the extremity depends
> on how slow the 'reconstructed in memory' step is.  
> (Fromhttp://www.pytables.org/docs/manual/ch01.html#id2506782).  The
> counterexample would be needing random access into multiple data
> files, which don't all fit in memory at once, but the maturity of the
> package might outweigh that.  Reconstruction will form a bottleneck
> anyway.

Hmm, this was a part of a documentation that needed to be updated.
Now, the object tree is reconstructed in a lazy way (i.e. on-demand),
in order to avoid the bottleneck that you mentioned.  I have corrected
the docs in:

http://www.pytables.org/trac/changeset/3714/trunk

Thanks for (indirectly ;-) bringing this to my attention,

Francesc
--
http://mail.python.org/mailman/listinfo/python-list

Invoking setup.py in sub-packages

2010-05-13 Thread Francesc

Hi,

I'd like to setup a package that is make of other sub-packages,
modules and other extensions.  What I have is something like (this is
very simplified indeed):

/
__init__.py
setup.py
foo1/
__init__.py
foo1.c
[...]
foo2/
setup.py
__init__.py
foo2.c
[...]

Now, I'd like to make a setup.py for package 'foo' that is able to
compile the 'foo1' extension directly, but invoke 'foo2/setup.py' in
order to generate the 'foo2' sub-package (which contains an extension
itself).

I'm currently trying:

setup(name = "foo",
  ...
  ext_modules = [ Extension( "foo.foo1", sources=["foo1.c"] ) ],
  packages = ["foo.foo1", "foo.foo2"],
  ...
  )

Of course, the foo2 package is not made at all.  Is there a way to get
the 'foo2/setup.py' invoked automagically and generate the desired
extensions?  Or all the instructions to generate the extension in
sub-packages have to be hard-wired in main 'foo/setup.py'?

Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list

Unsupported operand type(s) for +: 'float' and 'tuple'

2011-06-10 Thread Francesc Segura

Hello all, I'm new to this and I'm having problems on summing two
values at python.

I get the following error:

Traceback (most recent call last):
  File "C:\edge-bc (2).py", line 168, in 
if (costGG <= cost + T0):
TypeError: unsupported operand type(s) for +: 'float' and 'tuple'

I'm working with networkx and my program is this one:


import networkx as NX
import pylab as P
from math import exp, log
import random
try:
import matplotlib.pyplot as plt
except:
raise

##N=27
N=30
ITERATIONS = 60
T0 = 0.5
RO = 0.99
NK = 20

def costG(G):

bc = NX.edge_betweenness_centrality(G,normalized=True)
edges = NX.edges(G)

for i in range(len(edges)):
total = 0
cost = 0
factor = 1

liedges = list(edges[i])
linode1 = list(liedges[0])
linode2 = list(liedges[1])

distance = (((linode2[0]-linode1[0])%N)^2)+(((linode2[1]-
linode1[1])%N)^2)

edgecentrality = bc[edges[i]]

factor = (distance-19790)*(-0.55586)

cost = distance*edgecentrality*factor

total = total + cost

return(total)

def avedistance(G):




return (AvgDist)

def costGeasy(G):
bc = NX.edge_betweenness_centrality(G,normalized=True)
total = 0
for i in range(len(bc)):
total=total+bc.values()[i]

return (total)


G = NX.grid_2d_graph(N,N,True)

for i in range(N):
for j in range(N):
G.add_edge((i,j),((i+1) % N ,(j+1) % N))
G.add_edge((i,j),((i-1) % N ,(j+1) % N))


NODES=NX.number_of_nodes(G)
nod=NX.nodes(G)
EDGES=NX.number_of_edges(G)
edg=NX.edges(G)

pos={}
for i in range(NODES):
pos[nod[i]]=(nod[i][0]/(N*1.0),nod[i][1]/(N*1.0))

NX.draw(G,pos,node_color='r',node_size=20,with_labels=False,width=1)
plt.title("Inicial graph, Toroidal 27x27, degree 8")
plt.savefig("initial_grid_malla.png")
plt.show()

pos=NX.spring_layout(G,iterations=100)
NX.draw(G,pos,node_color='r',node_size=20,with_labels=False,width=1)
plt.title("Inicial graph, Toroidal 27x27, degree 8")
plt.savefig("initial_grid_tor.png")
plt.show()

initGr = G
best_graph = G
best_cost = costG(G)
average_distance = avedistance(G)
initCost = best_cost
initGHist = NX.degree_histogram(G)

##print NX.info(initGr)
##print 'Diameter =  %f ' % (NX.diameter(G))
##print 'Avg Clust Coeff. NetworkX =  %-6.4f ' %
(NX.average_clustering(G))
##print 'Avg Dist. NetworkX =  %-6.4f ' % (average_distance)
##print 'Distribucio de Graus'
##print initGHist
##print 'Cost inicial'
##print initCost

T0 = initCost*0,1

for y in range(NK):
for x in range(ITERATIONS):
cost = costG(G)
if (cost < (best_cost)):
best_graph = G
best_cost = cost
GG = G
u = random.randint(0,NODES-1)
while GG.degree(nod[u]) <= 1:
u = random.randint(0,NODES-1)
v = random.randint(0,GG.degree(nod[u])-1)
GG.remove_edge(nod[u],GG[nod[u]].keys()[v])
a = random.randint(0,NODES-1)
b = random.randint(0,NODES-1)
adj=G.adjacency_list()
while ((nod[b] in adj[a]) or (b == a)):
a = random.randint(0,NODES-1)
b = random.randint(0,NODES-1)
GG.add_edge(nod[a],nod[b])
while (NX.is_connected(GG) == 0):
GG = G
u = random.randint(0,NODES-1)
while GG.degree(nod[u]) <= 1:
u = random.randint(0,NODES-1)
v = random.randint(0,GG.degree(nod[u])-1)
GG.remove_edge(nod[u],GG[nod[u]].keys()[v])
a = random.randint(0,NODES-1)
b = random.randint(0,NODES-1)
adj=GG.adjacency_list()
while ((nod[b] in adj[a]) or (b == a)):
a = random.randint(0,NODES-1)
b = random.randint(0,NODES-1)
GG.add_edge(nod[a],nod[b])

costGG = costG(GG)
if (costGG <= cost):
G = GG
else:
if (costGG <= cost + T0):
G = GG

T0 = T0 * RO
print 'IT %d' % y
print 'BEST %f ' % best_cost


best_graph = G
print NX.info(best_graph)
print 'Diameter =  %f ' % (NX.diameter(best_graph))
print 'Avg Clust Coeff. NetworkX =  %-6.4f ' %
(NX.average_clustering(best_graph))
average_distance = avedistance(best_graph)
print 'Avg Dist. NetworkX =  %-6.4f ' % (average_distance)
print 'Distribucio de Graus'
print NX.degree_histogram(best_graph)
print 'Millor Cost'
print best_cost

NX.write_edgelist(best_graph,'optimal-graph.dat')

pos={}
for i in range(NODES):
pos[nod[i]]=(nod[i][0]/(N*1.0),nod[i][1]/(N*1.0))


NX.draw(best_graph,pos,node_color='r',node_size=20,with_labels=False,width=1)
plt.title("Final graph, Toroidal 27x27, degree 8")
plt.savefig("final_grid_malla.png")
plt.show()

pos=NX.spring_layout(G,iterations=100)
NX.draw(best_graph,pos,node_color='r',node_size=20,with_labels=False,width=1)
plt.title("Final graph, Toroidal 27x27, degree 8")
plt.savefig("final_grid_tor.png")
plt.show()
-- 
http://mail.python.org/mailman/listinfo/python-list

Unsupported operand type(s) for +: 'float' and 'tuple'

2011-06-10 Thread Francesc Segura

Hello all, I'm new to this and I'm having problems on summing two
values at python.

I get the following error:

Traceback (most recent call last):
  File "C:\edge-bc (2).py", line 168, in 
if (costGG <= cost + T0):
TypeError: unsupported operand type(s) for +: 'float' and 'tuple'

I'm working with networkx and my program is this one:


import networkx as NX
import pylab as P
from math import exp, log
import random
try:
import matplotlib.pyplot as plt
except:
raise

##N=27
N=30
ITERATIONS = 60
T0 = 0.5
RO = 0.99
NK = 20

def costG(G):

bc = NX.edge_betweenness_centrality(G,normalized=True)
edges = NX.edges(G)

for i in range(len(edges)):
total = 0
cost = 0
factor = 1

liedges = list(edges[i])
linode1 = list(liedges[0])
linode2 = list(liedges[1])

distance = (((linode2[0]-linode1[0])%N)^2)+(((linode2[1]-
linode1[1])%N)^2)

edgecentrality = bc[edges[i]]

factor = (distance-19790)*(-0.55586)

cost = distance*edgecentrality*factor

total = total + cost

return(total)

def avedistance(G):




return (AvgDist)

def costGeasy(G):
bc = NX.edge_betweenness_centrality(G,normalized=True)
total = 0
for i in range(len(bc)):
total=total+bc.values()[i]

return (total)


G = NX.grid_2d_graph(N,N,True)

for i in range(N):
for j in range(N):
G.add_edge((i,j),((i+1) % N ,(j+1) % N))
G.add_edge((i,j),((i-1) % N ,(j+1) % N))


NODES=NX.number_of_nodes(G)
nod=NX.nodes(G)
EDGES=NX.number_of_edges(G)
edg=NX.edges(G)

pos={}
for i in range(NODES):
pos[nod[i]]=(nod[i][0]/(N*1.0),nod[i][1]/(N*1.0))

NX.draw(G,pos,node_color='r',node_size=20,with_labels=False,width=1)
plt.title("Inicial graph, Toroidal 27x27, degree 8")
plt.savefig("initial_grid_malla.png")
plt.show()

pos=NX.spring_layout(G,iterations=100)
NX.draw(G,pos,node_color='r',node_size=20,with_labels=False,width=1)
plt.title("Inicial graph, Toroidal 27x27, degree 8")
plt.savefig("initial_grid_tor.png")
plt.show()

initGr = G
best_graph = G
best_cost = costG(G)
average_distance = avedistance(G)
initCost = best_cost
initGHist = NX.degree_histogram(G)

##print NX.info(initGr)
##print 'Diameter =  %f ' % (NX.diameter(G))
##print 'Avg Clust Coeff. NetworkX =  %-6.4f ' %
(NX.average_clustering(G))
##print 'Avg Dist. NetworkX =  %-6.4f ' % (average_distance)
##print 'Distribucio de Graus'
##print initGHist
##print 'Cost inicial'
##print initCost

T0 = initCost*0,1

for y in range(NK):
for x in range(ITERATIONS):
cost = costG(G)
if (cost < (best_cost)):
best_graph = G
best_cost = cost
GG = G
u = random.randint(0,NODES-1)
while GG.degree(nod[u]) <= 1:
u = random.randint(0,NODES-1)
v = random.randint(0,GG.degree(nod[u])-1)
GG.remove_edge(nod[u],GG[nod[u]].keys()[v])
a = random.randint(0,NODES-1)
b = random.randint(0,NODES-1)
adj=G.adjacency_list()
while ((nod[b] in adj[a]) or (b == a)):
a = random.randint(0,NODES-1)
b = random.randint(0,NODES-1)
GG.add_edge(nod[a],nod[b])
while (NX.is_connected(GG) == 0):
GG = G
u = random.randint(0,NODES-1)
while GG.degree(nod[u]) <= 1:
u = random.randint(0,NODES-1)
v = random.randint(0,GG.degree(nod[u])-1)
GG.remove_edge(nod[u],GG[nod[u]].keys()[v])
a = random.randint(0,NODES-1)
b = random.randint(0,NODES-1)
adj=GG.adjacency_list()
while ((nod[b] in adj[a]) or (b == a)):
a = random.randint(0,NODES-1)
b = random.randint(0,NODES-1)
GG.add_edge(nod[a],nod[b])

costGG = costG(GG)
if (costGG <= cost):
G = GG
else:
if (costGG <= cost + T0):
G = GG

T0 = T0 * RO
print 'IT %d' % y
print 'BEST %f ' % best_cost


best_graph = G
print NX.info(best_graph)
print 'Diameter =  %f ' % (NX.diameter(best_graph))
print 'Avg Clust Coeff. NetworkX =  %-6.4f ' %
(NX.average_clustering(best_graph))
average_distance = avedistance(best_graph)
print 'Avg Dist. NetworkX =  %-6.4f ' % (average_distance)
print 'Distribucio de Graus'
print NX.degree_histogram(best_graph)
print 'Millor Cost'
print best_cost

NX.write_edgelist(best_graph,'optimal-graph.dat')

pos={}
for i in range(NODES):
pos[nod[i]]=(nod[i][0]/(N*1.0),nod[i][1]/(N*1.0))


NX.draw(best_graph,pos,node_color='r',node_size=20,with_labels=False,width=1)
plt.title("Final graph, Toroidal 27x27, degree 8")
plt.savefig("final_grid_malla.png")
plt.show()

pos=NX.spring_layout(G,iterations=100)
NX.draw(best_graph,pos,node_color='r',node_size=20,with_labels=False,width=1)
plt.title("Final graph, Toroidal 27x27, degree 8")
plt.savefig("final_grid_tor.png")
plt.show()
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unsupported operand type(s) for +: 'float' and 'tuple'

2011-06-10 Thread Francesc Segura

On 10 jun, 13:38, Tim Chase  wrote:
> On 06/10/2011 05:30 AM, Francesc Segura wrote:
>
> > Hello all, I'm new to this and I'm having problems on summing two
> > values at python.
>
> > I get the following error:
>
> > Traceback (most recent call last):
> >    File "C:\edge-bc (2).py", line 168, in
> >      if (costGG<= cost + T0):
> > TypeError: unsupported operand type(s) for +: 'float' and 'tuple'
>
> > I'm working with networkx and my program is this one:
> ...
> > T0 = initCost*0,1
>
> Here, you're setting T0 to the tuple "(initCost*0, 1)" == "(0,
> 1)".  I think you mean to use a period instead of a comma.
>
> You then try to add that to a float (cost), and Python doesn't
> like that.  I wouldn't either :)
>
> -tkc

Thanks a lot, I am a noob retard!
-- 
http://mail.python.org/mailman/listinfo/python-list

ANN: PyTables 0.9.1 is out

2004-12-04 Thread Francesc Altet

very large XML files, or for creating a
centralized repository for system logs, to name only a few possible
uses.
 
What is a table?


A table is defined as a collection of records whose values are stored in
fixed-length fields. All records have the same structure and all values
in each field have the same data type.  The terms "fixed-length" and
"strict data types" seem to be quite a strange requirement for a
language like Python that supports dynamic data types, but they serve a
useful function if the goal is to save very large quantities of data
(such as is generated by many scientific applications, for example) in
an efficient manner that reduces demand on CPU time and I/O resources.

What is HDF5?
-

For those people who know nothing about HDF5, it is a general purpose
library and file format for storing scientific data made at NCSA. HDF5
can store two primary objects: datasets and groups. A dataset is
essentially a multidimensional array of data elements, and a group is a
structure for organizing objects in an HDF5 file. Using these two basic
constructs, one can create and store almost any kind of scientific data
structure, such as images, arrays of vectors, and structured and
unstructured grids. You can also mix and match them in HDF5 files
according to your needs.

Platforms
-

I'm using Linux (Intel 32-bit) as the main development platform, but
PyTables should be easy to compile/install on many other UNIX
machines. This package has also passed all the tests on a UltraSparc
platform with Solaris 7 and Solaris 8. It also compiles and passes all
the tests on a SGI Origin2000 with MIPS R12000 processors, with the
MIPSPro compiler and running IRIX 6.5. It also runs fine on Linux
64-bit platforms, like AMD Opteron running GNU/Linux 2.4.21 Server,
Intel Itanium (IA64) running GNU/Linux 2.4.21 or PowerPC G5 with Linux
2.6.x in 64bit mode. It has also been tested in MacOSX platforms (10.2
but should also work on newer versions).

Regarding Windows platforms, PyTables has been tested with Windows
2000 and Windows XP (using the Microsoft Visual C compiler), but it
should also work with other flavors as well.

Web site


Go to the PyTables web site for more details:

http://pytables.sourceforge.net/

To know more about the company behind the PyTables development, see:

http://www.carabos.com/

Share your experience
-

Let me know of any bugs, suggestions, gripes, kudos, etc. you may
have.

Enjoy data!

-- 
Francesc Altet
Who's your data daddy?  PyTables

--
http://mail.python.org/mailman/listinfo/python-list

ANN: bcolz 0.7.0, columnar, chunked and compressed datasets at your fingertips

2014-07-24 Thread Francesc Alted

==
Announcing bcolz 0.7.0
==

What's new
==

In this release, support for Python 3 has been added, Pandas and
HDF5/PyTables conversion, support for different compressors via latest
release of Blosc, and a new `iterblocks()` iterator.

Also, intensive benchmarking has lead to an important tuning of buffer
sizes parameters so that compression and evaluation goes faster than
ever. Together, bcolz and the Blosc compressor, are finally fullfilling
the promise of accelerating memory I/O, at least for some real
scenarios:

http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots

``bcolz`` is a renaming of the ``carray`` project. The new goals for
the project are to create simple, yet flexible compressed containers,
that can live either on-disk or in-memory, and with some
high-performance iterators (like `iter()`, `where()`) for querying them.

For more detailed info, see the release notes in:
https://github.com/Blosc/bcolz/wiki/Release-Notes

What it is
==

bcolz provides columnar and compressed data containers. Column storage
allows for efficiently querying tables with a large number of columns.
It also allows for cheap addition and removal of column. In addition,
bcolz objects are compressed by default for reducing memory/disk I/O
needs. The compression process is carried out internally by Blosc, a
high-performance compressor that is optimized for binary data.

bcolz can use numexpr internally so as to accelerate many vector and
query operations (although it can use pure NumPy for doing so too).
numexpr optimizes the memory usage and use several cores for doing the
computations, so it is blazing fast. Moreover, the carray/ctable
containers can be disk-based, and it is possible to use them for
seamlessly performing out-of-memory computations.

bcolz has minimal dependencies (NumPy), comes with an exhaustive test
suite and fully supports both 32-bit and 64-bit platforms. Also, it is
typically tested on both UNIX and Windows operating systems.

Installing
==

bcolz is in the PyPI repository, so installing it is easy:

$ pip install -U bcolz

Resources
=

Visit the main bcolz site repository at:
http://github.com/Blosc/bcolz

Manual:
http://bcolz.blosc.org

Home of Blosc compressor:
http://blosc.org

User's mail list:
bc...@googlegroups.com
http://groups.google.com/group/bcolz

License is the new BSD:
https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt

**Enjoy data!**

-- Francesc Alted
--
https://mail.python.org/mailman/listinfo/python-list

Re: Populating a dictionary, fast [SOLVED]

2007-11-13 Thread Francesc Altet

A Monday 12 November 2007, Michael Bacarella escrigué:
> As for the solution, after trying a half-dozen different integer
> hashing functions
> and hash table sizes (the brute force approach), on a total whim I
> switched to a
> model with two dictionary tiers and got whole orders of magnitude
> better performance.
>
> The tiering is, for a given key of type long:
>
> id2name[key >> 40][key & 0x100] = name
>
> Much, much better.  A few minutes versus hours this way.
>
> I suspect it could be brought down to seconds with a third level of
> tiers but this is no longer posing the biggest bottleneck... ;)

I don't know exactly why do you need a dictionary for keeping the data, 
but in case you want ultra-fast access to values, there is no 
replacement for keeping a sorted list of keys and a list with the 
original indices to values, and the proper list of values.  Then, to 
access a value, you only have to do a binary search on the sorted list, 
another lookup in the original indices list and then go straight to the 
value in the value list.  This should be the faster approach I can 
think of.

Another possibility is using an indexed column in a table in a DB.  
Lookups there should be much faster than using a dictionary as well.

HTH,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Populating a dictionary, fast [SOLVED]

2007-11-14 Thread Francesc Altet

A Tuesday 13 November 2007, Steven D'Aprano escrigué:
> On Tue, 13 Nov 2007 17:27:11 +0100, Francesc Altet wrote:
> > I don't know exactly why do you need a dictionary for keeping the
> > data, but in case you want ultra-fast access to values, there is no
> > replacement for keeping a sorted list of keys and a list with the
> > original indices to values, and the proper list of values.  Then,
> > to access a value, you only have to do a binary search on the
> > sorted list, another lookup in the original indices list and then
> > go straight to the value in the value list.  This should be the
> > faster approach I can think of.
>
> Maybe on Bizarro World, but not on Planet Earth.
>
> Searching a dict takes on average a constant amount of time,
> regardless of the key and the size of the dict. Because dicts are the
> basis of Python's namespaces, it is optimized up the wazoo, and being
> implemented in C it is extremely fast.
>
> Using your suggestion, to get a value: you have to do a binary search
> which is O(log N) instead of O(1), then ANOTHER lookup into a list of
> indices, and finally a THIRD lookup to actually retrieve the value --
> all implemented in relatively slow Python instead of fast C.

Well, the bisect module in Python is implemented in fast C.  Apart from 
that, you are basically right, see below.

> And let's not even discuss the overhead of keeping these three lists
> synchronized.
>
> But don't take my word for it: measure for yourself. I'm not even
> going to attempt to code your suggestion, but here's how you measure
> the time it takes for dictionary lookup.
>
>
> # Create a dataset to search on.
> D = {}
> chars = "abcdefghijklmnopqrstuvwxyz"
> triplets = (a+b+c for a in chars for b in chars for c in chars)
> for i, s in enumerate(triplets):
> D[s] = i  # D is of the form {'abc': 12345}
>
> # Now time a successful search for an arbitrary triplet.
> import timeit
> min(timeit.Timer("D['foo']", "from __main__ import D").repeat(6))
>
>
> On my PC, it takes about 0.2 seconds per million searches.

Oh yes.  All of you are right, guys.  I've implemented code (attached) 
for measuring the lookup times using a a dictionary and using a binary 
search approach (using numpy, mostly for space efficiency) and here are 
my results:

$ python2.5 gq-dict.py
Creating the dictionary...
Time for dict creation: 89.115
Items in dict: 8191180
Timing queries...
Time for querying: 39.44

$ python2.5 gq-binary-search.py
Creating the lists...
Time for lists creation: 4.245
Sorting...
Time for sorting: 6.195
Timing queries...
Time for querying: 108.1

i.e. a dict approach proves to be almost 3x faster than a regular binary 
search (5 us/lookup vs 13 us/lookup).

I think I've got messed on some benchmarks that I've done on that 
subject some time ago, but either my memory is bad or I've made some 
mistake on those experiments.  My apologies.

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"


gq-binary-search.py
Description: application/python


gq-dict.py
Description: application/python
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Populating a dictionary, fast [SOLVED]

2007-11-14 Thread Francesc Altet

A Wednesday 14 November 2007, Istvan Albert escrigué:
> On Nov 13, 11:27 am, Francesc Altet <[EMAIL PROTECTED]> wrote:
> > Another possibility is using an indexed column in a table in a DB.
> > Lookups there should be much faster than using a dictionary as
> > well.
>
> I would agree with others who state that for a simple key based
> lookup nothing beats a dictionary.
>
> But knowing that you are one of the authors of the excellent pytables
> module I think you are approaching this problem from the perspective
> of reading in a large number of values on each access. In those cases
> specialized libraries (like the hdf) can directly load into memory
> huge amounts of continuous data at speeds that substantially
> outperform all other approaches.

Yes, could be.  However, I was well aware that the problem was looking 
up one single value, so, mea culpa :(.  Sorry for introducing noise.

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Populating a dictionary, fast [SOLVED]

2007-11-19 Thread Francesc Altet

A Wednesday 14 November 2007, Francesc Altet escrigué:
> I think I've got messed on some benchmarks that I've done on that
> subject some time ago, but either my memory is bad or I've made some
> mistake on those experiments.  My apologies.

Just for the record.  I was unable to stop thinking about this, and 
after some investigation, I guess that my rememberings were gathered 
from some benchmarks with code in Pyrex (i.e. pretty near to C speed).

In the attached benchmark, I compare the speed of dictionary lookups in 
a big dictionary (more than 8 millions of entries) against doing the 
same, but using a binary search approach.  Here are the results:

$ python2.5 gq-binary-search.py
Creating the dictionary...
Time for dict creation: 13.001
Calibrating loop...
Calibrating time: 3.095
Timing queries with a dict...
Time for dict query: 7.747
Timing queries with binary search...
Time for binary queries: 3.551

so, a dictionary lookup takes around 950 ns, while a binary search 
lookup takes around 380 ns, i.e. binary searches are more than 2x 
faster than dictionary lookups, at least in this benchmark.

It might well be that the difference is due to the fact that the binary 
lookup in this case is made for only 64-bit integers, while the 
dictionary code has to check the type of index data first.  OTOH, it is 
also apparent that the binary code can be optimized for cache misses by 
using two levels of indirection for binary lookups, but this is too 
much work for today ;).  At any rate, it seems rather clear that binary 
lookups can be competive against hashes, when using Pyrex at least!

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"


gq-binary-search.py
Description: application/python


setup.py
Description: application/python
# BinarySearch class to demonstrate that binary search is competitive
# against dictionaries based on hashes

# In order to access the data area in NumPy objects
cdef extern from "numpy/arrayobject.h":

  ctypedef extern class numpy.ndarray [object PyArrayObject]:
cdef char *data

  void import_array()

import_array()


cdef long bisect_left(long long *a, long long x, long hi):
cdef long lo, mid

lo = 0
if x <= a[0]: return 0
if a[-1] < x: return hi
while lo < hi:
mid = (lo+hi)/2
if a[mid] < x: lo = mid+1
else: hi = mid
return lo


cdef class BinarySearch:
cdef long len
cdef ndarray str2
cdef long *revidd
cdef long long *sidd

def __new__(self, ndarray sid, ndarray revid, ndarray str2):
self.str2 = str2
self.revidd = revid.data
self.sidd = sid.data
self.len = len(str2) - 1

def __getitem__(self, x):
cdef long pos, idx

pos = bisect_left(self.sidd, x, self.len)
idx = self.revidd[pos]
return self.str2[idx]
-- 
http://mail.python.org/mailman/listinfo/python-list

Dictionary vs binary search lookups [was: Populating a dictionary, fast [SOLVED]]

2007-11-21 Thread Francesc Altet

A Tuesday 20 November 2007, Istvan Albert escrigué:
> On Nov 19, 2:33 pm, Francesc Altet <[EMAIL PROTECTED]> wrote:
> > Just for the record.  I was unable to stop thinking about this, and
> > after some investigation, I guess that my rememberings were
> > gathered from some benchmarks with code in Pyrex (i.e. pretty near
> > to C speed).
>
> Pretty interesting and quite unexpected.

Yeah, and also bogus :(

It turned out that in my previous benchmark, I've used plain numpy int 
arrays to do measurements, and when you extract an element out of a 
numpy array what you obtain is a *numpy scalar*, which is a different 
beast than a python (long) integer.  Unfortunately, pyrex does swallow 
it and happily converted to a long long C, but produces a wrong result.  
This looks like a pyrex fault, and I should report it to the pyrex 
list.

At any rate, with the corrected benchmarks using plain python lists 
instead (attached), the numbers are:

Calibrating loop...
Calibrating time: 1.458
Creating the dictionary...
Time for dict creation: 15.305
Timing queries with a dict...
Time for dict query: 11.632
Creating BinarySearch object...
Time for BinarySearch creation: 8.648
Timing queries with a BinarySearch...
Time for BinarySearch queries: 19.393

That is, dictionaries do lookups 1.7x faster than a binary lookup (1.42 
us/lookup vs 2.37 us/lookup), which is, I suppose, what is expected.  
Well, this seems the end of my 'peculiar' theory, but at least served 
to uncover what it seems a bug in pyrex :P

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"
# BinarySearch class to demonstrate that binary search is competitive
# against dictionaries based on hashes

import numpy

# In order to access the data area in NumPy objects
cdef extern from "numpy/arrayobject.h":

  ctypedef extern class numpy.ndarray [object PyArrayObject]:
cdef char *data

  void import_array()

import_array()

cdef long bisect_left(long long *a, long long x, long lo, long hi):
cdef long mid

while lo < hi:
mid = (lo+hi)/2
if a[mid] < x: lo = mid+1
else: hi = mid
return lo

cdef class BinarySearch:
cdef long len
cdef object values
cdef long *revidd
cdef long long *sidd
cdef ndarray sid, revid

def __new__(self, object keys, object values):
self.sid = numpy.array(keys)
self.revid = self.sid.argsort()
self.revidd = self.revid.data
self.sid.sort()
self.sidd = self.sid.data
self.values = values
self.len = len(values) - 1

def __getitem__(self, x):
cdef long pos, idx

pos = bisect_left(self.sidd, x, 0, self.len)
idx = self.revidd[pos]
return self.values[idx]

gq-binary-search.py
Description: application/python

setup.py
Description: application/python
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python good for data mining?

Re: dynamic allocation file buffer

Invoking setup.py in sub-packages

Unsupported operand type(s) for +: 'float' and 'tuple'

Unsupported operand type(s) for +: 'float' and 'tuple'

Re: Unsupported operand type(s) for +: 'float' and 'tuple'

ANN: PyTables 0.9.1 is out

ANN: bcolz 0.7.0, columnar, chunked and compressed datasets at your fingertips

Re: Populating a dictionary, fast [SOLVED]

Re: Populating a dictionary, fast [SOLVED]

Re: Populating a dictionary, fast [SOLVED]

Re: Populating a dictionary, fast [SOLVED]

Dictionary vs binary search lookups [was: Populating a dictionary, fast [SOLVED]]

13 matches

Site Navigation

Mail list logo

Footer information