Hey Joe,

I think you are running into a limitation we found in JVector 1 and
the use of HNSW. This is where release timing sucks. Jonathan
continued work on JVector after merging the initial version into
Cassandra 5 trunk before the code freeze. There was JVector 2 and then
3 which is in the JVector repo now with the PQ you mentioned. This
will be merged into trunk as we get closer to release, but it will
still be in trunk before we release 5.1

A couple of options. DataStax maintains a public fork of the forward
version of Cassandra we develop for use in Astra. This is where our
upstream contributions come from. You can find it here:
https://github.com/datastax/cassandra. You can build it locally.
However, I think you need JDK21 to benefit most from the latest Vector
code. If you want to see if it works with your use case without
compiling, you can try it for free on Astra and if it works there, it
will work on this repo.

Hope this helps

Patrick

On Thu, Nov 7, 2024 at 2:20 PM Joe Obernberger
<joseph.obernber...@gmail.com> wrote:
>
> Found my issue, it was with the primary key being a combination of uuid
> and type.  With that fixed, I now have a table with 1.5 million vectors
> (768 dimensions) on a 16 node cluster. While I can now execute a CQL
> query that includes fields and the order by ANN, it runs too slow.  No
> query completes in the standard cqlsh; they all return OperationTimedOut.
>
> It did work on 100k vectors, but with larger data it times out. We've
> evaluated Weaviate, but have since switched to QDrant as our current
> vector database.  It's not clear to me how to set things like product
> quantization with Cassandra.  I see that JVector supports it, but there
> doesn't appear to be any documentation on how do make use of it in
> Cassandra.
>
> Options?  Ideas?
> Thank you!
>
> -Joe
>
> On 11/6/2024 10:16 AM, Joe Obernberger wrote:
> > Hi All - have a table in Cassandra 5.02 that has several columns and a
> > vector column.
> >
> > I'm trying to do a hybrid query that includes a column and the
> > ordering using ANN.  Such as:
> >
> > select textdata from doc.google_gtr_t5_large where type='type1' order
> > by embeddings ANN of [-0.005542, 0.000996, 0.039524, -0.004628,
> > -0.017905, -0.002265, -0.119871...] limit 10;
> >
> > This results in:
> > InvalidRequest: Error from server: code=2200 [Invalid query]
> > message="ANN ordering by vector requires all restricted column(s) to
> > be indexed"
> >
> > The embeddings column has an SAI index on it, and the type column does
> > as well.
> > Queries such as:
> >
> > select textdata from doc.google_gtr_t5_large where type='type1' and
> > source ='somesource';
> > works fine.
> >
> > How can I create a table where I can combine a vector search with a
> > column or columns such as a text or timestamp column?
> >
> > Full table definition:
> >
> > CREATE TABLE doc.google_gtr_t5_large (
> >     uuid text,
> >     type text,
> >     fieldname text,
> >     offset int,
> >     textdata text,
> >     creationdate timestamp,
> >     embeddings vector<float, 768>,
> >     metadata boolean,
> >     source text,
> >     sourceurl text,
> >     PRIMARY KEY ((uuid, type), fieldname, offset, textdata)
> > ) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, textdata ASC)
> >     AND additional_write_policy = '99p'
> >     AND allow_auto_snapshot = true
> >     AND bloom_filter_fp_chance = 0.01
> >     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
> >     AND cdc = false
> >     AND comment = ''
> >     AND compaction = {'class':
> > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> > 'max_threshold': '32', 'min_threshold': '4'}
> >     AND compression = {'chunk_length_in_kb': '16', 'class':
> > 'org.apache.cassandra.io.compress.LZ4Compressor'}
> >     AND memtable = 'default'
> >     AND crc_check_chance = 1.0
> >     AND default_time_to_live = 0
> >     AND extensions = {}
> >     AND gc_grace_seconds = 864000
> >     AND incremental_backups = true
> >     AND max_index_interval = 2048
> >     AND memtable_flush_period_in_ms = 0
> >     AND min_index_interval = 128
> >     AND read_repair = 'BLOCKING'
> >     AND speculative_retry = '99p';
> >
> > CREATE CUSTOM INDEX ann_index ON doc.google_gtr_t5_large (embeddings)
> > USING 'sai';
> >
> > CREATE CUSTOM INDEX creationidx ON doc.google_gtr_t5_large
> > (creationdate) USING 'sai';
> >
> > CREATE CUSTOM INDEX sourceidx ON doc.google_gtr_t5_large (source)
> > USING 'sai';
> >
> > CREATE CUSTOM INDEX typeidx ON doc.google_gtr_t5_large (type) USING
> > 'sai';
> >
> > Thank you!
> >
> > -Joe
> >
> >
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com

Reply via email to