Perhaps I misunderstand your proposal, but it seems that even with your manual key placement schemes, the row would still be huge, no matter what node it gets placed on. A better solution might be figuring out how to make each row into a few smaller ones to get better balancing of load and also faster reads.
- Can you segment the column(s) of the row into different, predictably-named rows? - Or segment into different rows and use a secondary index to find the rows that are part of a particular RDF? - And/or compress the RDF data (maybe you're already doing that) to reduce the impact of large rows? On Sat, Jul 9, 2011 at 4:27 PM, Günter Ladwig <guenter.lad...@kit.edu>wrote: > Hi all, > > we are currently looking at using Cassandra to store highly skewed RDF > data. With the indexes we use it may happen that a single row contains up to > 20% of the whole dataset, meaning that it can grow larger than available > disk space on single nodes. In [1], it says that this limitation is not > likely to change in the future, but I was wondering if anybody has looked at > this problem? > > One thing that comes to mind is a simple approach to DHT load-balancing > [2], where keys are assigned to one node of several random alternatives > (which means that for reading, all these nodes have to be queried). This is > a bit similar to replication, except, of course, that only one copy of the > data is stored. As this would require changes to the Cassandra code base, we > could "simulate" this by randomly choosing one of several predefined > suffixes and appending it to a key before storing it. By modifying a key > this way, we could be somewhat sure that it will be stored at a different > node. The first solution would certainly be preferable. > > Any thoughts or experiences? Failing that, maybe someone can give me a > pointer into the Cassandra code base, where something like the [2] should be > implemented. > > Cheers, > Günter > > [1] http://wiki.apache.org/cassandra/CassandraLimitations > [2] Byers at el.: Simple Load Balancing for Distributed Hash Tables, > http://www.springerlink.com/content/r9r4qcqxc2bmfqmr/ > > -- > > Dipl.-Inform. Günter Ladwig > > Karlsruhe Institute of Technology (KIT) > Institute AIFB > > Englerstraße 11 (Building 11.40, Room 250) > 76131 Karlsruhe, Germany > Phone: +49 721 608-47946 > Email: guenter.lad...@kit.edu > Web: www.aifb.kit.edu > > KIT – University of the State of Baden-Württemberg and National Large-scale > Research Center of the Helmholtz Association > >