[ https://issues.apache.org/jira/browse/SOLR-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Guillaume Jactat updated SOLR-17487: ------------------------------------ Description: *EDIT 10/10/2024* : After a detailed analysis of the problematic vectors, I found that the “missing” dimensions were actually dimensions of the same value. In concrete terms, the values present several times in the posted vectors are deduplicated by Solr. You can see for yourself that the vectors supplied as attachments have the common characteristic of containing {*}two or more occurences of the very same float value{*}. The embedding model I use (all-minilm:33m) seems to generate many such cases. It seems that {*}Solr only takes into account the first occurrence of these values{*}. As a result, the length of the final vector is no longer correct. The following screenshot show exactly what happens. With a smaller vector field type of size 5. We can see that the vector [1, 5, 3, 4, 5] becomes [1, 5, 3, 4]. !image-2024-10-10-23-27-26-566.png! --------------------------------------------- Hello, I'm using Solr 9.7 as a vector database. I've come across something I can't explain : I POST my documents as JSON and I've got a vector field of dimension {*}768{*}. The JSON document I POST has a vector field, which is an array of length 768. Each value is a float. Solr complains that my array is only *767* long... I've compared the JSON I POST and the array parsed by Solr and written in the logs.... And indeed, one of the 768 values has simply disappeared in the process. The problem can easily be reproduced. All you have to do is : * In your "schema.xml", declare the following dense vector field type : {code:java} <fieldType name="knn_vector_768" class="solr.DenseVectorField" vectorDimension="768" similarityFunction="cosine"/>{code} * In your schema.xml, declare the followig dense vector dynamic field : {code:java} <dynamicField name="*_vector_768" type="knn_vector_768" indexed="true" stored="true"/>{code} * Use the Solr Admin UI to post the *attached document* to your Solr core. * You should get the following error : "{*}incorrect vector dimension. The vector value has size 767 while it is expected a vector with size 768"{*} * Furthermore, while the POSTed vector has 768 size, the vector written in the logs is only 767... One value is missing. You can easily spot the missing value with a simple diff. Maybe someone will find the reason why this specific vector leads to this issue. Of course, I have plenty of others documents that get indexed without any issue. In case it helps, the value that disappears from the 768 vector is "0.0335415453". It's the 384th dimension (starting from 1) !image-2024-10-10-18-07-19-370.png! Thanks for reading was: *EDIT 10/10/2024* : After a detailed analysis of the problematic vectors, I found that the “missing” dimensions were actually dimensions of the same value. In concrete terms, the values present several times in the posted vectors are deduplicated by Solr. You can see for yourself that the vectors supplied as attachments have the common characteristic of containing {*}two or more occurences of the very same float value{*}. The embedding model I use (all-minilm:33m) seems to generate many such cases. It seems that {*}Solr only takes into account the first occurrence of these values{*}. As a result, the length of the final vector is no longer correct. --------------------------------------------- Hello, I'm using Solr 9.7 as a vector database. I've come across something I can't explain : I POST my documents as JSON and I've got a vector field of dimension {*}768{*}. The JSON document I POST has a vector field, which is an array of length 768. Each value is a float. Solr complains that my array is only *767* long... I've compared the JSON I POST and the array parsed by Solr and written in the logs.... And indeed, one of the 768 values has simply disappeared in the process. The problem can easily be reproduced. All you have to do is : * In your "schema.xml", declare the following dense vector field type : {code:java} <fieldType name="knn_vector_768" class="solr.DenseVectorField" vectorDimension="768" similarityFunction="cosine"/>{code} * In your schema.xml, declare the followig dense vector dynamic field : {code:java} <dynamicField name="*_vector_768" type="knn_vector_768" indexed="true" stored="true"/>{code} * Use the Solr Admin UI to post the *attached document* to your Solr core. * You should get the following error : "{*}incorrect vector dimension. The vector value has size 767 while it is expected a vector with size 768"{*} * Furthermore, while the POSTed vector has 768 size, the vector written in the logs is only 767... One value is missing. You can easily spot the missing value with a simple diff. Maybe someone will find the reason why this specific vector leads to this issue. Of course, I have plenty of others documents that get indexed without any issue. In case it helps, the value that disappears from the 768 vector is "0.0335415453". It's the 384th dimension (starting from 1) !image-2024-10-10-18-07-19-370.png! Thanks for reading > Can't POST a dense vector that contains two or more occurences of the same > float value > -------------------------------------------------------------------------------------- > > Key: SOLR-17487 > URL: https://issues.apache.org/jira/browse/SOLR-17487 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: UpdateRequestProcessors > Affects Versions: 9.7, 9.6.1 > Reporter: Guillaume Jactat > Priority: Major > Attachments: image-2024-10-10-18-05-01-195.png, > image-2024-10-10-18-07-14-904.png, image-2024-10-10-18-07-19-370.png, > image-2024-10-10-23-27-26-566.png, vector-384.json, vector-384.xml, > vector-768.json > > > *EDIT 10/10/2024* : > After a detailed analysis of the problematic vectors, I found that the > “missing” dimensions were actually dimensions of the same value. > In concrete terms, the values present several times in the posted vectors are > deduplicated by Solr. > You can see for yourself that the vectors supplied as attachments have the > common characteristic of containing {*}two or more occurences of the very > same float value{*}. The embedding model I use (all-minilm:33m) seems to > generate many such cases. > It seems that {*}Solr only takes into account the first occurrence of these > values{*}. As a result, the length of the final vector is no longer correct. > The following screenshot show exactly what happens. With a smaller vector > field type of size 5. We can see that the vector [1, 5, 3, 4, 5] becomes [1, > 5, 3, 4]. > !image-2024-10-10-23-27-26-566.png! > > --------------------------------------------- > Hello, > > I'm using Solr 9.7 as a vector database. I've come across something I can't > explain : I POST my documents as JSON and I've got a vector field of > dimension {*}768{*}. > > The JSON document I POST has a vector field, which is an array of length 768. > Each value is a float. > > Solr complains that my array is only *767* long... > I've compared the JSON I POST and the array parsed by Solr and written in the > logs.... And indeed, one of the 768 values has simply disappeared in the > process. > > The problem can easily be reproduced. All you have to do is : > * In your "schema.xml", declare the following dense vector field type : > {code:java} > <fieldType name="knn_vector_768" class="solr.DenseVectorField" > vectorDimension="768" similarityFunction="cosine"/>{code} > * In your schema.xml, declare the followig dense vector dynamic field : > {code:java} > <dynamicField name="*_vector_768" type="knn_vector_768" indexed="true" > stored="true"/>{code} > * Use the Solr Admin UI to post the *attached document* to your Solr core. > * You should get the following error : "{*}incorrect vector dimension. The > vector value has size 767 while it is expected a vector with size 768"{*} > > * Furthermore, while the POSTed vector has 768 size, the vector written in > the logs is only 767... One value is missing. You can easily spot the missing > value with a simple diff. > Maybe someone will find the reason why this specific vector leads to this > issue. Of course, I have plenty of others documents that get indexed without > any issue. > In case it helps, the value that disappears from the 768 vector is > "0.0335415453". It's the 384th dimension (starting from 1) > !image-2024-10-10-18-07-19-370.png! > Thanks for reading -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org