Hi,

I have a column family storing very large blobs that I would not like to
duplicate, if possible.
Here's a simplified version:

CREATE TABLE timeline (
   key text,
   a int,
   b int,
   value blob,
   PRIMARY KEY (key, a, b)
);

On this, I run exactly two types of query. Both of them must have a query
range on 'a', and just one must have 'b' restricted.

First query:

cqlsh> SELECT * FROM timeline where key = 'event' and a >= 2 and a <= 3;

This one runs fine.

Second query:

cqlsh> SELECT * FROM timeline where key = 'event' and a >= 2 and a <= 3 and
b = 12;
code=2200 [Invalid query] message="PRIMARY KEY column "b" cannot be
restricted (preceding column "ColumnDefinition{name=a,
type=org.apache.cassandra.db.marshal.Int32Type, kind=CLUSTERING_COLUMN,
componentIndex=0, indexName=null, indexType=null}" is either not restricted
or by a non-EQ relation)"

This fails. Even if I create an index:

CREATE INDEX timeline_b ON timeline (b);
cqlsh> SELECT * FROM timeline where key = 'event' and a >= 2 and a <= 3 and
b = 12;
code=2200 [Invalid query] message="Cannot execute this query as it might
involve data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability, use
ALLOW FILTERING"

I solved this problem by duplicating the column family (in "timeline_by_a"
and "timeline_by_b" where a and b are in opposite order), but I'm wondering
if there's a better solution, as this tends to grow pretty big.

In particular, from the little understanding that I have of the Cassandra
internals, it seems like even the second query should be fairly efficient
since the clustering columns are stored in order on disk, thus I don't
understand the ALLOW FILTERING requirement.

Another alternative that I'm thinking is just keeping another column family
that will serve as an "index" and I'll manually manage it in the
application.

Thanks.

Reply via email to