[ https://issues.apache.org/jira/browse/SOLR-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pierre Salagnac updated SOLR-17487: ----------------------------------- Attachment: image.png > Can't POST a dense vector that contains two or more occurences of the same > float value > -------------------------------------------------------------------------------------- > > Key: SOLR-17487 > URL: https://issues.apache.org/jira/browse/SOLR-17487 > Project: Solr > Issue Type: Bug > Components: UpdateRequestProcessors > Affects Versions: 9.7, 9.6.1 > Reporter: Guillaume Jactat > Priority: Major > Attachments: image-2024-10-10-18-05-01-195.png, > image-2024-10-10-18-07-14-904.png, image-2024-10-10-18-07-19-370.png, > image-2024-10-10-23-27-26-566.png, image.png, vector-384.json, > vector-384.xml, vector-768.json > > > *EDIT 10/10/2024* : > After a detailed analysis of the problematic vectors, I found that the > “missing” dimensions were actually dimensions of the same value. > In concrete terms, the values present several times in the posted vectors are > deduplicated by Solr. > You can see for yourself that the vectors supplied as attachments have the > common characteristic of containing {*}two or more occurences of the very > same float value{*}. The embedding model I use (all-minilm:33m) seems to > generate many such cases. > It seems that {*}Solr only takes into account the first occurrence of these > values{*}. As a result, the length of the final vector is no longer correct. > The following screenshot show exactly what happens. With a smaller vector > field type of size 5. We can see that the vector [1, 5, 3, 4, 5] becomes [1, > 5, 3, 4]. > !image-2024-10-10-23-27-26-566.png! > > --------------------------------------------- > Hello, > > I'm using Solr 9.7 as a vector database. I've come across something I can't > explain : I POST my documents as JSON and I've got a vector field of > dimension {*}768{*}. > > The JSON document I POST has a vector field, which is an array of length 768. > Each value is a float. > > Solr complains that my array is only *767* long... > I've compared the JSON I POST and the array parsed by Solr and written in the > logs.... And indeed, one of the 768 values has simply disappeared in the > process. > > The problem can easily be reproduced. All you have to do is : > * In your "schema.xml", declare the following dense vector field type : > {code:java} > <fieldType name="knn_vector_768" class="solr.DenseVectorField" > vectorDimension="768" similarityFunction="cosine"/>{code} > * In your schema.xml, declare the followig dense vector dynamic field : > {code:java} > <dynamicField name="*_vector_768" type="knn_vector_768" indexed="true" > stored="true"/>{code} > * Use the Solr Admin UI to post the *attached document* to your Solr core. > * You should get the following error : "{*}incorrect vector dimension. The > vector value has size 767 while it is expected a vector with size 768"{*} > > * Furthermore, while the POSTed vector has 768 size, the vector written in > the logs is only 767... One value is missing. You can easily spot the missing > value with a simple diff. > Maybe someone will find the reason why this specific vector leads to this > issue. Of course, I have plenty of others documents that get indexed without > any issue. > In case it helps, the value that disappears from the 768 vector is > "0.0335415453". It's the 384th dimension (starting from 1) > !image-2024-10-10-18-07-19-370.png! > Thanks for reading -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org