Really interesting question Artur. Have you gone any further? I think, based on my experience and recalling Cassandra's good practices, that full denormalisation is the Cassandra way to go.
Cheers Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso> On 11 September 2015 at 08:53, Artur Siekielski <a...@vhex.net> wrote: > I store documents submitted by users, with optional tags (lists of > strings): > > CREATE TABLE doc ( > user_id uuid, > date text, // part of partition key, to distribute data better > doc_id uuid, > tags list<text>, > contents text, > PRIMARY KEY((user_id, date), doc_id) > ); > > What is the best way to implement tag filtering? A user can select a list > of tags and get documents with the tags. I thought about: > > 1) Full denormalization - include tags in the primary key and insert a doc > for each subset of specified tags. This will however lead to large disk > space usage, because there are 2**n subsets (for 10 tags and a 1MB doc > 1000MB would be written). > > 2) Secondary index on 'tags' collection, and using queries like: > SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND tags > CONTAINS=? ... > > Since I will supply partition key value, I assume there will be no > problems with contacting multiple nodes. But how well will it work for > hundreds of thousands of results? I think intersection of tag matches needs > to be performed in memory so it will not scale well. > > 3) Partial denormalization - do inserts for each single tag and then > manually compute intersection. However in the worst case it can lead to > scanning almost the whole table. > > 4) Full denormalization but without contents. I would get correct doc_ids > fast, then I would need to use '... WHERE doc_id IN ?' with potentially a > very large list of doc_ids. > > > What's Cassandra's way to implement this? >