Couple of thoughts...

- Use only the custom secondary indexes you have built using your categories. 
Pull back all items that match and then do additional filtering and sorting 
client side. 

- If you really need very flexible query semantics and the use case allows for 
it consider using something like Pig or Hive to perform the analysis. 

There is some work going on to improve indexes, not sure if add's all the 
things you are looking for. 

Aaron


On 4 Apr 2011, at 22:31, Miroslav Madecki wrote:

> Hi,
> 
> I would like to combine custom secondary index with Cassandra's
> built-in secondary indexes.
> 
> Custom secondary index uses multiple columns and is based on
> historical but incomplete info about matching data in target column family. 
> It produces list of "candidate" rows which
> could be too large to be processed at the client side.
> 
> For matching on remaining attributes/columns I'd like to use
> built-in secondary indexes to filter out the data at the server side. 
> 
> Now given the context above I have following questions:
> 
> 1) Am I using the right model for the problem at hand?
>    Detailed description can be found below.
> 
> 2) If it is just a question of missing functionality
>    what is the best way to add it?
>    Add IN/OR operator to indexed_clause/get_indexed_slice, add
>    filter clause support with IN operator to multiget or
>    something else.
>    More info on this at the end of detailed description below.
>    
> 3) Is somebody already working on functionality I'm missing?
> 
> Thanks
> Miro
> 
> 
> ItemCF
> 
> Each Item is a member of multiple category subsets:
> A, B, C, D ...
> Membership is represented with composite column names,
> one column for each subset element:
> A:1,A:2,A:3, B:15,B:16:B17, C:5,C6
> Each item has several hundreds of such value-less columns.
> 
> In addition to that elements have multiple name/value columns:
> X, Y, Z with either string or numeric values.
> 
> ---------+---+---+---+-----+-----+-----+-----+------+------+---+
> Row Key  | X | Y | Z | ... | A:1 | A:2 | A:3 | B:15 | B:16 |.. |
> ---------+---+---+---+-----+-----+-----+-----+------+------+---+    
> <item id>|   |   |   |     |     |     |     |      |      |   |
> ---------+---+---+---+-----+-----+-----+-----+------+------+---+
> 
> Problem
> 
> I'm trying to answer query like this:
> 
> Which items are most likely to occur in a specific combination
> of categories A=2 AND B=15 AND C=6 ...
> that also satisfy condition such as:
> X=25 AND Y=default AND (C=some OR C=none)
> results are sorted by number of occurrences in descending order.
> 
> To get sorted number of occurrences in specific A, B, C category
> combination I'm using secondary index. Index is created using
> very large number of statistical samples.
> 
> ItemFrequencyCF
> 
> -----------+----------+----------+----------+-----------+-----+
> Row Key    | <1><id1> | <5><id2> | <5><id3> | <17><id4> | ... |
> -----------+----------+----------+----------+-----------+-----+
> <A>:<B>:<C>|    <id1> |    <id2> |    <id3> |     <id4> |     |
> -----------+----------+----------+----------+-----------+-----+
> 
> Column names are composite integers where frequency
> is stored in upper half and item id in lower half to get sorting
> by frequency.
> Row width can reach hundreds or sometimes thousands of columns.
> We are interested in most cases in only dozens or so items
> but their attributes X, Y, Z may not match so we have to
> do additional filtering on items from ItemCF and look further
> down the list.
> 
> That filtering can't be performed by extending ItemFrequencyCF
> index for following reasons.
> - Statistical samples we use to create index are missing data
>   for X,Y,Z
> - filtering is not always performed with EQ and AND operators
> 
> In order to perform final step of getting matching rows from
> ItemCF I'd like to use id keys from ItemFrequencyCF and some
> sort of index clause to perform filtering on remaining
> attributes X, Y, Z.
> 
> I can get rows using MULTIGET or do X, Y, Z filtering using
> GET_INDEXED_SLICE but there is no operation that combines
> these two.
> Also indexed clause doesn't support OR/IN operator needed when
> columns can contain more than one value.
> 
> Actually OR/IN operator may be used as MULTIGET equivalent
> if we add 'id' column to ItemCF that has the item id (row key)
> and create index expression such as:
> - (id=<id1> OR id=<id2> OR id=<id3> OR id=<id4> ......)
> or
> - id IN (<id1>, <id2>, <id3>, <id4>)
> 
> Obviously 'id' column would have to be declared as index
> column and the expression should be the first one in an index clause used 
> (limitation of current built-in secondary index
> implementation).

Reply via email to