On Mon, Jan 18, 2016 at 8:52 PM, Kevin Burton <bur...@spinn3r.com> wrote:

> Internally we have the need for a blob store for web content.  It's MOSTLY
> key, ,value based but we'd like to have lookups by coarse grained tags.
>
> This needs to store normal web content like HTML , CSS, JPEG, SVG, etc.
>
> Highly doubt that anything over 5MB would need to be stored.
>
> We also need the ability to store older versions of the same URL for
> features like "time travel" where we can see what the web looks like over
> time.
>
> I initially wrote this for Elasticsearch (and it works well for that) but
> it looks like binaries snuck into the set of requirements.
>
> I could Base64 encode/decode them in ES I guess but that seems ugly.
>
> I was thinking of porting this over to CS but I'm not up to date on the
> current state of blobs in C*...
>
> Any advice?
>

We (Wikimedia Foundation) use Cassandra as a durable cache for HTML (with
history).  A simplified version of the schema we use would look something
like:

CREATE TABLE data (
    key text,
    rev int,
    tid timeuuid,
    value blob,
    PRIMARY KEY (("_domain", key), rev, tid)
)

In our case, a 'rev' represents a normative change to the document (read:
someone made an edit), and the 'tid' attribute allows for some arbitrary
number of HTML representations of that revision (say if for example some
transclusion would alter the final outcome).  You could simplify this
further by removing the 'tid' attribute if this doesn't apply to you.

One concern here is the size of blobs.  Where exactly the threshold on size
should be is probably debatable, but if you are using G1GC I would be
careful about what large blobs do to humongous allocations.  G1 will
allocate anything over 1/2 the region size as humongous, and special-case
the handling of them, so humongous allocations should be the exception and
not the rule.  Depending on your heap size and the distribution of blob
sizes, you might be able to get by with overriding the GC's choice of
region size, but if 5MB values are at all common, you'll need 16MB region
sizes, (which probably won't work well without a very large corresponding
max heap size).

Another concern is row width.  With a data-model like this, rows will grow
relative to the number of versions stored.  If versions are added at a low
rate, that might not pose an issue in practice, if it does though you'll
need to consider a different partitioning strategy.

TL;DR You need to understand what your data will look like.  Min and max
value sizes aren't enough, you should have some idea of size distribution,
read/write rates, etc.  Understand the implications of your data model.
And then test, test, test.


-- 
Eric Evans
eev...@wikimedia.org

Reply via email to