Perhaps I misunderstand your proposal, but it seems that even with your
manual key placement schemes, the row would still be huge, no matter what
node it gets placed on.  A better solution might be figuring out how to make
each row into a few smaller ones to get better balancing of load and also
faster reads.

- Can you segment the column(s) of the row into different, predictably-named
rows?
- Or segment into different rows and use a secondary index to find the rows
that are part of a particular RDF?
- And/or compress the RDF data (maybe you're already doing that) to reduce
the impact of large rows?

On Sat, Jul 9, 2011 at 4:27 PM, Günter Ladwig <guenter.lad...@kit.edu>wrote:

> Hi all,
>
> we are currently looking at using Cassandra to store highly skewed RDF
> data. With the indexes we use it may happen that a single row contains up to
> 20% of the whole dataset, meaning that it can grow larger than available
> disk space on single nodes. In [1], it says that this limitation is not
> likely to change in the future, but I was wondering if anybody has looked at
> this problem?
>
> One thing that comes to mind is a simple approach to DHT load-balancing
> [2], where keys are assigned to one node of several random alternatives
> (which means that for reading, all these nodes have to be queried). This is
> a bit similar to replication, except, of course, that only one copy of the
> data is stored. As this would require changes to the Cassandra code base, we
> could "simulate" this by randomly choosing one of several predefined
> suffixes and appending it to a key before storing it. By modifying a key
> this way, we could be somewhat sure that it will be stored at a different
> node. The first solution would certainly be preferable.
>
> Any thoughts or experiences? Failing that, maybe someone can give me a
> pointer into the Cassandra code base, where something like the [2] should be
> implemented.
>
> Cheers,
> Günter
>
> [1] http://wiki.apache.org/cassandra/CassandraLimitations
> [2] Byers at el.: Simple Load Balancing for Distributed Hash Tables,
> http://www.springerlink.com/content/r9r4qcqxc2bmfqmr/
>
> --
>
> Dipl.-Inform. Günter Ladwig
>
> Karlsruhe Institute of Technology (KIT)
> Institute AIFB
>
> Englerstraße 11 (Building 11.40, Room 250)
> 76131 Karlsruhe, Germany
> Phone: +49 721 608-47946
> Email: guenter.lad...@kit.edu
> Web: www.aifb.kit.edu
>
> KIT – University of the State of Baden-Württemberg and National Large-scale
> Research Center of the Helmholtz Association
>
>

Reply via email to