Found my issue, it was with the primary key being a combination of uuid
and type. With that fixed, I now have a table with 1.5 million vectors
(768 dimensions) on a 16 node cluster. While I can now execute a CQL
query that includes fields and the order by ANN, it runs too slow. No
query completes in the standard cqlsh; they all return OperationTimedOut.
It did work on 100k vectors, but with larger data it times out. We've
evaluated Weaviate, but have since switched to QDrant as our current
vector database. It's not clear to me how to set things like product
quantization with Cassandra. I see that JVector supports it, but there
doesn't appear to be any documentation on how do make use of it in
Cassandra.
Options? Ideas?
Thank you!
-Joe
On 11/6/2024 10:16 AM, Joe Obernberger wrote:
Hi All - have a table in Cassandra 5.02 that has several columns and a
vector column.
I'm trying to do a hybrid query that includes a column and the
ordering using ANN. Such as:
select textdata from doc.google_gtr_t5_large where type='type1' order
by embeddings ANN of [-0.005542, 0.000996, 0.039524, -0.004628,
-0.017905, -0.002265, -0.119871...] limit 10;
This results in:
InvalidRequest: Error from server: code=2200 [Invalid query]
message="ANN ordering by vector requires all restricted column(s) to
be indexed"
The embeddings column has an SAI index on it, and the type column does
as well.
Queries such as:
select textdata from doc.google_gtr_t5_large where type='type1' and
source ='somesource';
works fine.
How can I create a table where I can combine a vector search with a
column or columns such as a text or timestamp column?
Full table definition:
CREATE TABLE doc.google_gtr_t5_large (
uuid text,
type text,
fieldname text,
offset int,
textdata text,
creationdate timestamp,
embeddings vector<float, 768>,
metadata boolean,
source text,
sourceurl text,
PRIMARY KEY ((uuid, type), fieldname, offset, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, textdata ASC)
AND additional_write_policy = '99p'
AND allow_auto_snapshot = true
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND cdc = false
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND memtable = 'default'
AND crc_check_chance = 1.0
AND default_time_to_live = 0
AND extensions = {}
AND gc_grace_seconds = 864000
AND incremental_backups = true
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair = 'BLOCKING'
AND speculative_retry = '99p';
CREATE CUSTOM INDEX ann_index ON doc.google_gtr_t5_large (embeddings)
USING 'sai';
CREATE CUSTOM INDEX creationidx ON doc.google_gtr_t5_large
(creationdate) USING 'sai';
CREATE CUSTOM INDEX sourceidx ON doc.google_gtr_t5_large (source)
USING 'sai';
CREATE CUSTOM INDEX typeidx ON doc.google_gtr_t5_large (type) USING
'sai';
Thank you!
-Joe
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com