I came to a similar conclusion, that is if you have more than a few tags, then the problem is no more simple "tagging" but more like regular "document search" with indexed words. There are too many word subsets to precompute matching documents, so you need to index documents individually and compute intersections dynamically. And for acceptable performance you need indexes stored fully in memory in data structures allowing computing intersections fast. This is not something regular databases implement (but they can be used as backing storage for indexes loaded into memory).

So the solution is to either limit the number of tags to 3-4 and do full denormalization (up to 8-16 times duplication factor) or use a search engine.

On 09/16/2015 11:29 AM, Naresh Yadav wrote:
We also had similar usecase, after lot of trials with cassandra, we
finally created solr schema doc_id(unique key), tags(indexed)
in apache solr for answering search query "Get me matching docs by any
given no of tags" and that solved our usecase. We had usecase of
millions of docs and in tags we can have 100's of tags on a doc.

Please share your final conclusion if you crack this problem within
cassandra only, would be interested to know your solution.


Reply via email to