Re: Python good for data mining?
[I've just seen this thread. Although it might be a bit late, let me state a couple of precisions] On 6 Nov, 03:09, "D.Hering" <[EMAIL PROTECTED]> wrote: > On Nov 5, 10:29 am, Maarten <[EMAIL PROTECTED]> wrote: > > > As forpytables: it is the most elegant programming interface for HDF > > on any platform that I've encountered so far. Most other platforms > > stay close the HDF5 library C-interface, which is low-level, and quite > > complex.PyTableswas written with the end-user in mind, and it shows. > > One correction though:PyTablesis not a database: it is a storage for > > (large) arrays, datablocks that you don't want in a database. Use a > > database for the metadata to find the right file and field within that > > file. Keep in mind though that I mostly work with externally created > > HDF-5 files, not with files created inpytables.PyTablesPro has an > > indexing feature which may be helpful for datamining (if you write the > > hdf-5 files from python). > > > Maarten > > Hi Maarten, > > I respectfully disagree that HDF5 is not a DB. Its true that HDF5 on > its prima facie is not relational but rather hierarchical. Yeah. This largely depends on what we understand by a DB. Lately, RDBMs are used everywhere, and we tend to believe that they are the only entities that can be truly called DBs. However, in a less restrictive view, even a text file can be considered a DB (in fact, many DBs have been implemented using text files as a base). So, I wouldn't say that HDF5 is not a DB, but just that it is not a RDBM ;) > Hierarchical is truely a much more natural/elegant[1] design from my > perspective. HDF has always had meta-data capabilities and with the > new 1.8beta version available, it is increasing its ability with > 'references/links' allowing for pure/partial relational datasets, > groups, and files as well as storing self implemented indexing. > > The C API is obviously much more low level, and Pytables does not yet > support these new features. That's correct. And although it is well possible that we, PyTables developers, would end incorporating some relational features to it, we also recognize that PyTables does not intend (and was never in our plans) to be a competitor of a pure RDBMS, but rather a *teammate* (as it is clearly stated in the www.pytables.org home page). In our opinion, PyTables opens the door to a series of capabilities that are not found in typical RDBMS, like hierarchical classification, multidimensional datasets, powerful table entities that are able to deal with multidimensional columns or nested records, but must specially, the ability to work with extremely large amounts of data in a very easy way, without having to renounce to first-class speed. > [1] Anything/everything that is physical/virtual, or can be conceived > is hierarchical... if the system itself is not random/chaotic. Thats a > lovely revelation I've had... EVERYTHING is hierarchical. If it has > context it has hierarchy. While I agree that this sentence has a part of truth, it is also known that a lot of things (perhaps much more than we think) in the universe enter directly in the domain of random/chaotic ;) IMO, the wisest path should be recognizing the strengths (and weaknesses) of each approach and use whatever fits better to your needs. If you need the best of both then go ahead and choose a RDBMS in combination with a hierarchical DB, and utilize the powerful capabilities of Python to take the most out of them. Cheers, Francesc Altet -- http://mail.python.org/mailman/listinfo/python-list
Re: dynamic allocation file buffer
On 12 Set, 14:39, "Aaron \"Castironpi\" Brady" <[EMAIL PROTECTED]> wrote: > > A consideration of other storage formats such as HDF5 might > > be appropriate: > > >http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html > > > There are, of course, HDF5 tools available for Python. > > PyTablescame up within the past few weeks on the list. > > "When the file is created, the metadata in the object tree is updated > in memory while the actual data is saved to disk. When you close the > file the object tree is no longer available. However, when you reopen > this file the object tree will be reconstructed in memory from the > metadata on disk" > > This is different from what I had in mind, but the extremity depends > on how slow the 'reconstructed in memory' step is. > (Fromhttp://www.pytables.org/docs/manual/ch01.html#id2506782). The > counterexample would be needing random access into multiple data > files, which don't all fit in memory at once, but the maturity of the > package might outweigh that. Reconstruction will form a bottleneck > anyway. Hmm, this was a part of a documentation that needed to be updated. Now, the object tree is reconstructed in a lazy way (i.e. on-demand), in order to avoid the bottleneck that you mentioned. I have corrected the docs in: http://www.pytables.org/trac/changeset/3714/trunk Thanks for (indirectly ;-) bringing this to my attention, Francesc -- http://mail.python.org/mailman/listinfo/python-list
Invoking setup.py in sub-packages
Hi, I'd like to setup a package that is make of other sub-packages, modules and other extensions. What I have is something like (this is very simplified indeed): / __init__.py setup.py foo1/ __init__.py foo1.c [...] foo2/ setup.py __init__.py foo2.c [...] Now, I'd like to make a setup.py for package 'foo' that is able to compile the 'foo1' extension directly, but invoke 'foo2/setup.py' in order to generate the 'foo2' sub-package (which contains an extension itself). I'm currently trying: setup(name = "foo", ... ext_modules = [ Extension( "foo.foo1", sources=["foo1.c"] ) ], packages = ["foo.foo1", "foo.foo2"], ... ) Of course, the foo2 package is not made at all. Is there a way to get the 'foo2/setup.py' invoked automagically and generate the desired extensions? Or all the instructions to generate the extension in sub-packages have to be hard-wired in main 'foo/setup.py'? Thanks! -- http://mail.python.org/mailman/listinfo/python-list
Unsupported operand type(s) for +: 'float' and 'tuple'
Hello all, I'm new to this and I'm having problems on summing two values at python. I get the following error: Traceback (most recent call last): File "C:\edge-bc (2).py", line 168, in if (costGG <= cost + T0): TypeError: unsupported operand type(s) for +: 'float' and 'tuple' I'm working with networkx and my program is this one: import networkx as NX import pylab as P from math import exp, log import random try: import matplotlib.pyplot as plt except: raise ##N=27 N=30 ITERATIONS = 60 T0 = 0.5 RO = 0.99 NK = 20 def costG(G): bc = NX.edge_betweenness_centrality(G,normalized=True) edges = NX.edges(G) for i in range(len(edges)): total = 0 cost = 0 factor = 1 liedges = list(edges[i]) linode1 = list(liedges[0]) linode2 = list(liedges[1]) distance = (((linode2[0]-linode1[0])%N)^2)+(((linode2[1]- linode1[1])%N)^2) edgecentrality = bc[edges[i]] factor = (distance-19790)*(-0.55586) cost = distance*edgecentrality*factor total = total + cost return(total) def avedistance(G): return (AvgDist) def costGeasy(G): bc = NX.edge_betweenness_centrality(G,normalized=True) total = 0 for i in range(len(bc)): total=total+bc.values()[i] return (total) G = NX.grid_2d_graph(N,N,True) for i in range(N): for j in range(N): G.add_edge((i,j),((i+1) % N ,(j+1) % N)) G.add_edge((i,j),((i-1) % N ,(j+1) % N)) NODES=NX.number_of_nodes(G) nod=NX.nodes(G) EDGES=NX.number_of_edges(G) edg=NX.edges(G) pos={} for i in range(NODES): pos[nod[i]]=(nod[i][0]/(N*1.0),nod[i][1]/(N*1.0)) NX.draw(G,pos,node_color='r',node_size=20,with_labels=False,width=1) plt.title("Inicial graph, Toroidal 27x27, degree 8") plt.savefig("initial_grid_malla.png") plt.show() pos=NX.spring_layout(G,iterations=100) NX.draw(G,pos,node_color='r',node_size=20,with_labels=False,width=1) plt.title("Inicial graph, Toroidal 27x27, degree 8") plt.savefig("initial_grid_tor.png") plt.show() initGr = G best_graph = G best_cost = costG(G) average_distance = avedistance(G) initCost = best_cost initGHist = NX.degree_histogram(G) ##print NX.info(initGr) ##print 'Diameter = %f ' % (NX.diameter(G)) ##print 'Avg Clust Coeff. NetworkX = %-6.4f ' % (NX.average_clustering(G)) ##print 'Avg Dist. NetworkX = %-6.4f ' % (average_distance) ##print 'Distribucio de Graus' ##print initGHist ##print 'Cost inicial' ##print initCost T0 = initCost*0,1 for y in range(NK): for x in range(ITERATIONS): cost = costG(G) if (cost < (best_cost)): best_graph = G best_cost = cost GG = G u = random.randint(0,NODES-1) while GG.degree(nod[u]) <= 1: u = random.randint(0,NODES-1) v = random.randint(0,GG.degree(nod[u])-1) GG.remove_edge(nod[u],GG[nod[u]].keys()[v]) a = random.randint(0,NODES-1) b = random.randint(0,NODES-1) adj=G.adjacency_list() while ((nod[b] in adj[a]) or (b == a)): a = random.randint(0,NODES-1) b = random.randint(0,NODES-1) GG.add_edge(nod[a],nod[b]) while (NX.is_connected(GG) == 0): GG = G u = random.randint(0,NODES-1) while GG.degree(nod[u]) <= 1: u = random.randint(0,NODES-1) v = random.randint(0,GG.degree(nod[u])-1) GG.remove_edge(nod[u],GG[nod[u]].keys()[v]) a = random.randint(0,NODES-1) b = random.randint(0,NODES-1) adj=GG.adjacency_list() while ((nod[b] in adj[a]) or (b == a)): a = random.randint(0,NODES-1) b = random.randint(0,NODES-1) GG.add_edge(nod[a],nod[b]) costGG = costG(GG) if (costGG <= cost): G = GG else: if (costGG <= cost + T0): G = GG T0 = T0 * RO print 'IT %d' % y print 'BEST %f ' % best_cost best_graph = G print NX.info(best_graph) print 'Diameter = %f ' % (NX.diameter(best_graph)) print 'Avg Clust Coeff. NetworkX = %-6.4f ' % (NX.average_clustering(best_graph)) average_distance = avedistance(best_graph) print 'Avg Dist. NetworkX = %-6.4f ' % (average_distance) print 'Distribucio de Graus' print NX.degree_histogram(best_graph) print 'Millor Cost' print best_cost NX.write_edgelist(best_graph,'optimal-graph.dat') pos={} for i in range(NODES): pos[nod[i]]=(nod[i][0]/(N*1.0),nod[i][1]/(N*1.0)) NX.draw(best_graph,pos,node_color='r',node_size=20,with_labels=False,width=1) plt.title("Final graph, Toroidal 27x27, degree 8") plt.savefig("final_grid_malla.png") plt.show() pos=NX.spring_layout(G,iterations=100) NX.draw(best_graph,pos,node_color='r',node_size=20,with_labels=False,width=1) plt.title("Final graph, Toroidal 27x27, degree 8") plt.savefig("final_grid_tor.png") plt.show() -- http://mail.python.org/mailman/listinfo/python-list
Unsupported operand type(s) for +: 'float' and 'tuple'
Hello all, I'm new to this and I'm having problems on summing two values at python. I get the following error: Traceback (most recent call last): File "C:\edge-bc (2).py", line 168, in if (costGG <= cost + T0): TypeError: unsupported operand type(s) for +: 'float' and 'tuple' I'm working with networkx and my program is this one: import networkx as NX import pylab as P from math import exp, log import random try: import matplotlib.pyplot as plt except: raise ##N=27 N=30 ITERATIONS = 60 T0 = 0.5 RO = 0.99 NK = 20 def costG(G): bc = NX.edge_betweenness_centrality(G,normalized=True) edges = NX.edges(G) for i in range(len(edges)): total = 0 cost = 0 factor = 1 liedges = list(edges[i]) linode1 = list(liedges[0]) linode2 = list(liedges[1]) distance = (((linode2[0]-linode1[0])%N)^2)+(((linode2[1]- linode1[1])%N)^2) edgecentrality = bc[edges[i]] factor = (distance-19790)*(-0.55586) cost = distance*edgecentrality*factor total = total + cost return(total) def avedistance(G): return (AvgDist) def costGeasy(G): bc = NX.edge_betweenness_centrality(G,normalized=True) total = 0 for i in range(len(bc)): total=total+bc.values()[i] return (total) G = NX.grid_2d_graph(N,N,True) for i in range(N): for j in range(N): G.add_edge((i,j),((i+1) % N ,(j+1) % N)) G.add_edge((i,j),((i-1) % N ,(j+1) % N)) NODES=NX.number_of_nodes(G) nod=NX.nodes(G) EDGES=NX.number_of_edges(G) edg=NX.edges(G) pos={} for i in range(NODES): pos[nod[i]]=(nod[i][0]/(N*1.0),nod[i][1]/(N*1.0)) NX.draw(G,pos,node_color='r',node_size=20,with_labels=False,width=1) plt.title("Inicial graph, Toroidal 27x27, degree 8") plt.savefig("initial_grid_malla.png") plt.show() pos=NX.spring_layout(G,iterations=100) NX.draw(G,pos,node_color='r',node_size=20,with_labels=False,width=1) plt.title("Inicial graph, Toroidal 27x27, degree 8") plt.savefig("initial_grid_tor.png") plt.show() initGr = G best_graph = G best_cost = costG(G) average_distance = avedistance(G) initCost = best_cost initGHist = NX.degree_histogram(G) ##print NX.info(initGr) ##print 'Diameter = %f ' % (NX.diameter(G)) ##print 'Avg Clust Coeff. NetworkX = %-6.4f ' % (NX.average_clustering(G)) ##print 'Avg Dist. NetworkX = %-6.4f ' % (average_distance) ##print 'Distribucio de Graus' ##print initGHist ##print 'Cost inicial' ##print initCost T0 = initCost*0,1 for y in range(NK): for x in range(ITERATIONS): cost = costG(G) if (cost < (best_cost)): best_graph = G best_cost = cost GG = G u = random.randint(0,NODES-1) while GG.degree(nod[u]) <= 1: u = random.randint(0,NODES-1) v = random.randint(0,GG.degree(nod[u])-1) GG.remove_edge(nod[u],GG[nod[u]].keys()[v]) a = random.randint(0,NODES-1) b = random.randint(0,NODES-1) adj=G.adjacency_list() while ((nod[b] in adj[a]) or (b == a)): a = random.randint(0,NODES-1) b = random.randint(0,NODES-1) GG.add_edge(nod[a],nod[b]) while (NX.is_connected(GG) == 0): GG = G u = random.randint(0,NODES-1) while GG.degree(nod[u]) <= 1: u = random.randint(0,NODES-1) v = random.randint(0,GG.degree(nod[u])-1) GG.remove_edge(nod[u],GG[nod[u]].keys()[v]) a = random.randint(0,NODES-1) b = random.randint(0,NODES-1) adj=GG.adjacency_list() while ((nod[b] in adj[a]) or (b == a)): a = random.randint(0,NODES-1) b = random.randint(0,NODES-1) GG.add_edge(nod[a],nod[b]) costGG = costG(GG) if (costGG <= cost): G = GG else: if (costGG <= cost + T0): G = GG T0 = T0 * RO print 'IT %d' % y print 'BEST %f ' % best_cost best_graph = G print NX.info(best_graph) print 'Diameter = %f ' % (NX.diameter(best_graph)) print 'Avg Clust Coeff. NetworkX = %-6.4f ' % (NX.average_clustering(best_graph)) average_distance = avedistance(best_graph) print 'Avg Dist. NetworkX = %-6.4f ' % (average_distance) print 'Distribucio de Graus' print NX.degree_histogram(best_graph) print 'Millor Cost' print best_cost NX.write_edgelist(best_graph,'optimal-graph.dat') pos={} for i in range(NODES): pos[nod[i]]=(nod[i][0]/(N*1.0),nod[i][1]/(N*1.0)) NX.draw(best_graph,pos,node_color='r',node_size=20,with_labels=False,width=1) plt.title("Final graph, Toroidal 27x27, degree 8") plt.savefig("final_grid_malla.png") plt.show() pos=NX.spring_layout(G,iterations=100) NX.draw(best_graph,pos,node_color='r',node_size=20,with_labels=False,width=1) plt.title("Final graph, Toroidal 27x27, degree 8") plt.savefig("final_grid_tor.png") plt.show() -- http://mail.python.org/mailman/listinfo/python-list
Re: Unsupported operand type(s) for +: 'float' and 'tuple'
On 10 jun, 13:38, Tim Chase wrote: > On 06/10/2011 05:30 AM, Francesc Segura wrote: > > > Hello all, I'm new to this and I'm having problems on summing two > > values at python. > > > I get the following error: > > > Traceback (most recent call last): > > File "C:\edge-bc (2).py", line 168, in > > if (costGG<= cost + T0): > > TypeError: unsupported operand type(s) for +: 'float' and 'tuple' > > > I'm working with networkx and my program is this one: > ... > > T0 = initCost*0,1 > > Here, you're setting T0 to the tuple "(initCost*0, 1)" == "(0, > 1)". I think you mean to use a period instead of a comma. > > You then try to add that to a float (cost), and Python doesn't > like that. I wouldn't either :) > > -tkc Thanks a lot, I am a noob retard! -- http://mail.python.org/mailman/listinfo/python-list
ANN: PyTables 0.9.1 is out
very large XML files, or for creating a centralized repository for system logs, to name only a few possible uses. What is a table? A table is defined as a collection of records whose values are stored in fixed-length fields. All records have the same structure and all values in each field have the same data type. The terms "fixed-length" and "strict data types" seem to be quite a strange requirement for a language like Python that supports dynamic data types, but they serve a useful function if the goal is to save very large quantities of data (such as is generated by many scientific applications, for example) in an efficient manner that reduces demand on CPU time and I/O resources. What is HDF5? - For those people who know nothing about HDF5, it is a general purpose library and file format for storing scientific data made at NCSA. HDF5 can store two primary objects: datasets and groups. A dataset is essentially a multidimensional array of data elements, and a group is a structure for organizing objects in an HDF5 file. Using these two basic constructs, one can create and store almost any kind of scientific data structure, such as images, arrays of vectors, and structured and unstructured grids. You can also mix and match them in HDF5 files according to your needs. Platforms - I'm using Linux (Intel 32-bit) as the main development platform, but PyTables should be easy to compile/install on many other UNIX machines. This package has also passed all the tests on a UltraSparc platform with Solaris 7 and Solaris 8. It also compiles and passes all the tests on a SGI Origin2000 with MIPS R12000 processors, with the MIPSPro compiler and running IRIX 6.5. It also runs fine on Linux 64-bit platforms, like AMD Opteron running GNU/Linux 2.4.21 Server, Intel Itanium (IA64) running GNU/Linux 2.4.21 or PowerPC G5 with Linux 2.6.x in 64bit mode. It has also been tested in MacOSX platforms (10.2 but should also work on newer versions). Regarding Windows platforms, PyTables has been tested with Windows 2000 and Windows XP (using the Microsoft Visual C compiler), but it should also work with other flavors as well. Web site Go to the PyTables web site for more details: http://pytables.sourceforge.net/ To know more about the company behind the PyTables development, see: http://www.carabos.com/ Share your experience - Let me know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Altet Who's your data daddy? PyTables -- http://mail.python.org/mailman/listinfo/python-list
ANN: bcolz 0.7.0, columnar, chunked and compressed datasets at your fingertips
== Announcing bcolz 0.7.0 == What's new == In this release, support for Python 3 has been added, Pandas and HDF5/PyTables conversion, support for different compressors via latest release of Blosc, and a new `iterblocks()` iterator. Also, intensive benchmarking has lead to an important tuning of buffer sizes parameters so that compression and evaluation goes faster than ever. Together, bcolz and the Blosc compressor, are finally fullfilling the promise of accelerating memory I/O, at least for some real scenarios: http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots ``bcolz`` is a renaming of the ``carray`` project. The new goals for the project are to create simple, yet flexible compressed containers, that can live either on-disk or in-memory, and with some high-performance iterators (like `iter()`, `where()`) for querying them. For more detailed info, see the release notes in: https://github.com/Blosc/bcolz/wiki/Release-Notes What it is == bcolz provides columnar and compressed data containers. Column storage allows for efficiently querying tables with a large number of columns. It also allows for cheap addition and removal of column. In addition, bcolz objects are compressed by default for reducing memory/disk I/O needs. The compression process is carried out internally by Blosc, a high-performance compressor that is optimized for binary data. bcolz can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr optimizes the memory usage and use several cores for doing the computations, so it is blazing fast. Moreover, the carray/ctable containers can be disk-based, and it is possible to use them for seamlessly performing out-of-memory computations. bcolz has minimal dependencies (NumPy), comes with an exhaustive test suite and fully supports both 32-bit and 64-bit platforms. Also, it is typically tested on both UNIX and Windows operating systems. Installing == bcolz is in the PyPI repository, so installing it is easy: $ pip install -U bcolz Resources = Visit the main bcolz site repository at: http://github.com/Blosc/bcolz Manual: http://bcolz.blosc.org Home of Blosc compressor: http://blosc.org User's mail list: bc...@googlegroups.com http://groups.google.com/group/bcolz License is the new BSD: https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt **Enjoy data!** -- Francesc Alted -- https://mail.python.org/mailman/listinfo/python-list
Re: Populating a dictionary, fast [SOLVED]
A Monday 12 November 2007, Michael Bacarella escrigué: > As for the solution, after trying a half-dozen different integer > hashing functions > and hash table sizes (the brute force approach), on a total whim I > switched to a > model with two dictionary tiers and got whole orders of magnitude > better performance. > > The tiering is, for a given key of type long: > > id2name[key >> 40][key & 0x100] = name > > Much, much better. A few minutes versus hours this way. > > I suspect it could be brought down to seconds with a third level of > tiers but this is no longer posing the biggest bottleneck... ;) I don't know exactly why do you need a dictionary for keeping the data, but in case you want ultra-fast access to values, there is no replacement for keeping a sorted list of keys and a list with the original indices to values, and the proper list of values. Then, to access a value, you only have to do a binary search on the sorted list, another lookup in the original indices list and then go straight to the value in the value list. This should be the faster approach I can think of. Another possibility is using an indexed column in a table in a DB. Lookups there should be much faster than using a dictionary as well. HTH, -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" -- http://mail.python.org/mailman/listinfo/python-list
Re: Populating a dictionary, fast [SOLVED]
A Tuesday 13 November 2007, Steven D'Aprano escrigué: > On Tue, 13 Nov 2007 17:27:11 +0100, Francesc Altet wrote: > > I don't know exactly why do you need a dictionary for keeping the > > data, but in case you want ultra-fast access to values, there is no > > replacement for keeping a sorted list of keys and a list with the > > original indices to values, and the proper list of values. Then, > > to access a value, you only have to do a binary search on the > > sorted list, another lookup in the original indices list and then > > go straight to the value in the value list. This should be the > > faster approach I can think of. > > Maybe on Bizarro World, but not on Planet Earth. > > Searching a dict takes on average a constant amount of time, > regardless of the key and the size of the dict. Because dicts are the > basis of Python's namespaces, it is optimized up the wazoo, and being > implemented in C it is extremely fast. > > Using your suggestion, to get a value: you have to do a binary search > which is O(log N) instead of O(1), then ANOTHER lookup into a list of > indices, and finally a THIRD lookup to actually retrieve the value -- > all implemented in relatively slow Python instead of fast C. Well, the bisect module in Python is implemented in fast C. Apart from that, you are basically right, see below. > And let's not even discuss the overhead of keeping these three lists > synchronized. > > But don't take my word for it: measure for yourself. I'm not even > going to attempt to code your suggestion, but here's how you measure > the time it takes for dictionary lookup. > > > # Create a dataset to search on. > D = {} > chars = "abcdefghijklmnopqrstuvwxyz" > triplets = (a+b+c for a in chars for b in chars for c in chars) > for i, s in enumerate(triplets): > D[s] = i # D is of the form {'abc': 12345} > > # Now time a successful search for an arbitrary triplet. > import timeit > min(timeit.Timer("D['foo']", "from __main__ import D").repeat(6)) > > > On my PC, it takes about 0.2 seconds per million searches. Oh yes. All of you are right, guys. I've implemented code (attached) for measuring the lookup times using a a dictionary and using a binary search approach (using numpy, mostly for space efficiency) and here are my results: $ python2.5 gq-dict.py Creating the dictionary... Time for dict creation: 89.115 Items in dict: 8191180 Timing queries... Time for querying: 39.44 $ python2.5 gq-binary-search.py Creating the lists... Time for lists creation: 4.245 Sorting... Time for sorting: 6.195 Timing queries... Time for querying: 108.1 i.e. a dict approach proves to be almost 3x faster than a regular binary search (5 us/lookup vs 13 us/lookup). I think I've got messed on some benchmarks that I've done on that subject some time ago, but either my memory is bad or I've made some mistake on those experiments. My apologies. Cheers, -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" gq-binary-search.py Description: application/python gq-dict.py Description: application/python -- http://mail.python.org/mailman/listinfo/python-list
Re: Populating a dictionary, fast [SOLVED]
A Wednesday 14 November 2007, Istvan Albert escrigué: > On Nov 13, 11:27 am, Francesc Altet <[EMAIL PROTECTED]> wrote: > > Another possibility is using an indexed column in a table in a DB. > > Lookups there should be much faster than using a dictionary as > > well. > > I would agree with others who state that for a simple key based > lookup nothing beats a dictionary. > > But knowing that you are one of the authors of the excellent pytables > module I think you are approaching this problem from the perspective > of reading in a large number of values on each access. In those cases > specialized libraries (like the hdf) can directly load into memory > huge amounts of continuous data at speeds that substantially > outperform all other approaches. Yes, could be. However, I was well aware that the problem was looking up one single value, so, mea culpa :(. Sorry for introducing noise. -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" -- http://mail.python.org/mailman/listinfo/python-list
Re: Populating a dictionary, fast [SOLVED]
A Wednesday 14 November 2007, Francesc Altet escrigué: > I think I've got messed on some benchmarks that I've done on that > subject some time ago, but either my memory is bad or I've made some > mistake on those experiments. My apologies. Just for the record. I was unable to stop thinking about this, and after some investigation, I guess that my rememberings were gathered from some benchmarks with code in Pyrex (i.e. pretty near to C speed). In the attached benchmark, I compare the speed of dictionary lookups in a big dictionary (more than 8 millions of entries) against doing the same, but using a binary search approach. Here are the results: $ python2.5 gq-binary-search.py Creating the dictionary... Time for dict creation: 13.001 Calibrating loop... Calibrating time: 3.095 Timing queries with a dict... Time for dict query: 7.747 Timing queries with binary search... Time for binary queries: 3.551 so, a dictionary lookup takes around 950 ns, while a binary search lookup takes around 380 ns, i.e. binary searches are more than 2x faster than dictionary lookups, at least in this benchmark. It might well be that the difference is due to the fact that the binary lookup in this case is made for only 64-bit integers, while the dictionary code has to check the type of index data first. OTOH, it is also apparent that the binary code can be optimized for cache misses by using two levels of indirection for binary lookups, but this is too much work for today ;). At any rate, it seems rather clear that binary lookups can be competive against hashes, when using Pyrex at least! Cheers, -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" gq-binary-search.py Description: application/python setup.py Description: application/python # BinarySearch class to demonstrate that binary search is competitive # against dictionaries based on hashes # In order to access the data area in NumPy objects cdef extern from "numpy/arrayobject.h": ctypedef extern class numpy.ndarray [object PyArrayObject]: cdef char *data void import_array() import_array() cdef long bisect_left(long long *a, long long x, long hi): cdef long lo, mid lo = 0 if x <= a[0]: return 0 if a[-1] < x: return hi while lo < hi: mid = (lo+hi)/2 if a[mid] < x: lo = mid+1 else: hi = mid return lo cdef class BinarySearch: cdef long len cdef ndarray str2 cdef long *revidd cdef long long *sidd def __new__(self, ndarray sid, ndarray revid, ndarray str2): self.str2 = str2 self.revidd = revid.data self.sidd = sid.data self.len = len(str2) - 1 def __getitem__(self, x): cdef long pos, idx pos = bisect_left(self.sidd, x, self.len) idx = self.revidd[pos] return self.str2[idx] -- http://mail.python.org/mailman/listinfo/python-list
Dictionary vs binary search lookups [was: Populating a dictionary, fast [SOLVED]]
A Tuesday 20 November 2007, Istvan Albert escrigué: > On Nov 19, 2:33 pm, Francesc Altet <[EMAIL PROTECTED]> wrote: > > Just for the record. I was unable to stop thinking about this, and > > after some investigation, I guess that my rememberings were > > gathered from some benchmarks with code in Pyrex (i.e. pretty near > > to C speed). > > Pretty interesting and quite unexpected. Yeah, and also bogus :( It turned out that in my previous benchmark, I've used plain numpy int arrays to do measurements, and when you extract an element out of a numpy array what you obtain is a *numpy scalar*, which is a different beast than a python (long) integer. Unfortunately, pyrex does swallow it and happily converted to a long long C, but produces a wrong result. This looks like a pyrex fault, and I should report it to the pyrex list. At any rate, with the corrected benchmarks using plain python lists instead (attached), the numbers are: Calibrating loop... Calibrating time: 1.458 Creating the dictionary... Time for dict creation: 15.305 Timing queries with a dict... Time for dict query: 11.632 Creating BinarySearch object... Time for BinarySearch creation: 8.648 Timing queries with a BinarySearch... Time for BinarySearch queries: 19.393 That is, dictionaries do lookups 1.7x faster than a binary lookup (1.42 us/lookup vs 2.37 us/lookup), which is, I suppose, what is expected. Well, this seems the end of my 'peculiar' theory, but at least served to uncover what it seems a bug in pyrex :P -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" # BinarySearch class to demonstrate that binary search is competitive # against dictionaries based on hashes import numpy # In order to access the data area in NumPy objects cdef extern from "numpy/arrayobject.h": ctypedef extern class numpy.ndarray [object PyArrayObject]: cdef char *data void import_array() import_array() cdef long bisect_left(long long *a, long long x, long lo, long hi): cdef long mid while lo < hi: mid = (lo+hi)/2 if a[mid] < x: lo = mid+1 else: hi = mid return lo cdef class BinarySearch: cdef long len cdef object values cdef long *revidd cdef long long *sidd cdef ndarray sid, revid def __new__(self, object keys, object values): self.sid = numpy.array(keys) self.revid = self.sid.argsort() self.revidd = self.revid.data self.sid.sort() self.sidd = self.sid.data self.values = values self.len = len(values) - 1 def __getitem__(self, x): cdef long pos, idx pos = bisect_left(self.sidd, x, 0, self.len) idx = self.revidd[pos] return self.values[idx] gq-binary-search.py Description: application/python setup.py Description: application/python -- http://mail.python.org/mailman/listinfo/python-list