Really interesting question Artur. Have you gone any further?

I think, based on my experience and recalling Cassandra's good practices,
that full denormalisation is the Cassandra way to go.

Cheers

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 11 September 2015 at 08:53, Artur Siekielski <a...@vhex.net> wrote:

> I store documents submitted by users, with optional tags (lists of
> strings):
>
> CREATE TABLE doc (
>   user_id uuid,
>   date text, // part of partition key, to distribute data better
>   doc_id uuid,
>   tags list<text>,
>   contents text,
>   PRIMARY KEY((user_id, date), doc_id)
> );
>
> What is the best way to implement tag filtering? A user can select a list
> of tags and get documents with the tags. I thought about:
>
> 1) Full denormalization - include tags in the primary key and insert a doc
> for each subset of specified tags. This will however lead to large disk
> space usage, because there are 2**n subsets (for 10 tags and a 1MB doc
> 1000MB would be written).
>
> 2) Secondary index on 'tags' collection, and using queries like:
> SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND tags
> CONTAINS=? ...
>
> Since I will supply partition key value, I assume there will be no
> problems with contacting multiple nodes. But how well will it work for
> hundreds of thousands of results? I think intersection of tag matches needs
> to be performed in memory so it will not scale well.
>
> 3) Partial denormalization - do inserts for each single tag and then
> manually compute intersection. However in the worst case it can lead to
> scanning almost the whole table.
>
> 4) Full denormalization but without contents. I would get correct doc_ids
> fast, then I would need to use '... WHERE doc_id IN ?' with potentially a
> very large list of doc_ids.
>
>
> What's Cassandra's way to implement this?
>

Reply via email to