We also had similar usecase, after lot of trials with cassandra, we finally created solr schema doc_id(unique key), tags(indexed) in apache solr for answering search query "Get me matching docs by any given no of tags" and that solved our usecase. We had usecase of millions of docs and in tags we can have 100's of tags on a doc.
Please share your final conclusion if you crack this problem within cassandra only, would be interested to know your solution. On Fri, Sep 11, 2015 at 1:23 PM, Artur Siekielski <a...@vhex.net> wrote: > I store documents submitted by users, with optional tags (lists of > strings): > > CREATE TABLE doc ( > user_id uuid, > date text, // part of partition key, to distribute data better > doc_id uuid, > tags list<text>, > contents text, > PRIMARY KEY((user_id, date), doc_id) > ); > > What is the best way to implement tag filtering? A user can select a list > of tags and get documents with the tags. I thought about: > > 1) Full denormalization - include tags in the primary key and insert a doc > for each subset of specified tags. This will however lead to large disk > space usage, because there are 2**n subsets (for 10 tags and a 1MB doc > 1000MB would be written). > > 2) Secondary index on 'tags' collection, and using queries like: > SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND tags > CONTAINS=? ... > > Since I will supply partition key value, I assume there will be no > problems with contacting multiple nodes. But how well will it work for > hundreds of thousands of results? I think intersection of tag matches needs > to be performed in memory so it will not scale well. > > 3) Partial denormalization - do inserts for each single tag and then > manually compute intersection. However in the worst case it can lead to > scanning almost the whole table. > > 4) Full denormalization but without contents. I would get correct doc_ids > fast, then I would need to use '... WHERE doc_id IN ?' with potentially a > very large list of doc_ids. > > > What's Cassandra's way to implement this? >