Sorry Carl Banks for the answering delay, there are problems in Google Groups.
> This is not to study graph theory; I'm using the graph to represent a > problem domain. The graphs could be arbitrarily large, and could > easily have millions of nodes, and most nodes have a substantial > amount of data associated with them. Obviously I don't want a whole > such graph in memory at once, so libraries the just deal with in- > memory graphs are out. It doesn't sound a problem for modern PCs. I think you can keep the whole graph topology in RAM, and the node data on disk (for example in a file or in DB). The topology is represented by the arcs (you have said nothing regarding arc data, so I assume it's absent) and the nodes (in RAM you just need a 32 bit unsigned integer to represent the index of the node that is stored on disk. If memory becomes tight you can use just 3 bytes (2 ^ 24 = 16 millions different nodes) for the nodes, but generally the memory required for the nodes is very little compared to the memory necessary to store the arcs). You haven't said how many arcs there are, total or average for node, and if such arcs are directed or undirected. Anyway, using my Graph class (that stores each arc twice), this takes about 1 minute and 1.3 GB of RAM (1 million nodes, 10 arcs for node): from graph import Graph from random import randrange g = Graph() N = 1000000 g.addNodes(xrange(N)) for i in xrange(N * 10): g.addArc(randrange(N), randrange(N)) You have said "could easily have millions of nodes", and every arc may have tens of arcs or more. ("arbitrarily large" is an unsolvable problem because there are always limits in your program, there's no program that's able to work on an "arbitrarily large" dataset), so Python data structures become too much costly for the RAM. With a lower level language like D/C/C++ you can manage a bigger graph. You can use Boost Graph, but a homemade graph structure may suffice too for your purposes. For the problem you have explained I think a very simple graph representation may suffice: an array of integer pairs (starting_index, len) (starting_index can be a pointer too), where len is the number of outbound arcs of the node n, plus an array of indexes/pointers that list the outbound arcs. If memory gets tight you can split this second array in two, and use an array of bytes for the lengths (if you have more than 256 outbound arcs you may need a short). Note that if you use indexes then a Python array.array (or Numpy) suffices. In this situation if nnodes = 10_000_000 and narcs/node = 40 (total nodes = 40 * 10_000_000): nnodes * narcs * 4 + nnodes * (4 + 4) = 1_680_000_000 bytes, that is often available on modern PCs. On a 64 bit machine the indexes take the same memory, but pointers twice that. In "memory saving" mode: nnodes * narcs * 3 + nnodes * (3 + 1) = 1_240_000_000 bytes. A more handy compromise is: nnodes * narcs * 3 + nnodes * (4 + 4) = 1_280_000_000 bytes. > But then what happens if something changes elsewhere. Regarding the data structure, if you use arrays like I have explained, if your updates aren't much frequent then you can allocate small extra arrays to store more arcs coming out a node (but to do this you probably may enjoy using pointers instead of indexes). When you have a lot of updated you can save all to disk, and then regenerate the whole data structure. Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list