Hi all,

I'm currently looking at new database options for a URL shortener in order
to scale well with increased traffic as we add new features. Cassandra seems
to be a good fit for many of our requirements, but I'm struggling a bit to
find ways of designing certain indexes in Cassandra due to its 2GB row
limit.

The easiest example of this is that I'd like to create an index by the
domain that shortened URLs are linking to, mostly for spam control so it's
easy to grab all the links to any given domain. As far as I can tell the
typical way to do this in Cassandra is something like: -

DOMAIN = { //columnfamily
    thing.com { //row key
        timestamp: "shorturl567", //column name: value
        timestamp: "shorturl144",
        timestamp: "shorturl112",
        ...
    }
    somethingelse.com {
        timestamp: "shorturl817",
        ...
    }
}

The values here are keys for another columnfamily containing various data on
shortened URLs.

The problem with this approach is that a popular domain (e.g. blogspot.com)
could be used in many millions of shortened URLs, so would have that many
columns and hit the row size limit mentioned at
http://wiki.apache.org/cassandra/CassandraLimitations.

Does anyone know an effective way to design this type of one-to-many index
around this limitation (could be something obvious I'm missing)? If not, are
the changes proposed for
https://issues.apache.org/jira/browse/CASSANDRA-16likely to make this
type of design workable?

Thanks in advance for any advice,

Richard

Reply via email to