Emeric Bernet-Rollande created SOLR-17843:
---------------------------------------------
Summary: TextToVectorUpdateProcessor does not work with atomic
update
Key: SOLR-17843
URL: https://issues.apache.org/jira/browse/SOLR-17843
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: UpdateRequestProcessors, vector-search
Affects Versions: 9.9
Environment: I'm working on an *Ubuntu 22* VM, running a local
Datafari 6.3-DEV server with {*}Solr 9.9{*}.
Reporter: Emeric Bernet-Rollande
Attachments: solrconfig.xml
Hi,
I'm working on *Solr 9.9* and using the *TextToVectorUpdateProcessor* to enrich
documents with semantic vectors. However, I am facing an issue when I try to
use this processor with {_}*atomic update*{_}.
h2. Full context: Indexing / embeddings workflow
Solr is installed as a component of a search engine, Datafari. In this
scenario, Datafari crawles documents from a source (File share, web...), and
index them in a *FileShare* collection.
This FileShare collection has an Update Processor that chunks all documents
into smaller subdocuments (chunks), and send them to the *VectorMain*
collection.
Now, I need to vectorise the content of the chunks, using the
[TextToVectorUpdateProcessor|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html].
I used to call this processor in the main processor chain, so all incoming
chunks were embedded. Most of the time, it worked well, but this solution has
two major issues:
* It significantly increases the indexing time
* When an embedding fails for any reason (timeout, network error, LLM
exception...), the associated chunk {*}was not indexed{*}.
That is why I decided to dissociate the indexing from embeddings using [Atomic
Update|https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html].
Here is the new workflow:
* Chunks are indexed in the VectorMain collection without embeddings. The text
content is stored in the "{_}*embedded_content*{_}" field.
{code:java}
<field name="embedded_content" type="text_general" indexed="true" stored="true"
multiValued="false"/> {code}
* Then, we manually job the {*}Atomic Updates Jobs{*}, that retrieves all the
documents from VectorMain, and sends update requests to each of them using the
"{_}*/update/embed*{_}" handler. Here is what the requests look like:
{code:java}
[
....
{
"id": "file://///localhost/dataset/my_document.txt_4",
"embedded_content": { "set": "Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Aenean aliquet quam sed convallis malesuada." }
},
...
]{code}
And here is the handler & processor chain:
{code:java}
<!-- Request handler -->
<requestHandler class="solr.UpdateRequestHandler" name="/update/embed">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.language">ignored_</str>
<str name="fmap.source">ignored_</str>
<str name="fmap.version">ignored_</str>
<str name="fmap._version_">ignored_</str>
<str name="uprefix">ignored_</str>
<str name="update.chain">datafari-embed</str>
</lst>
</requestHandler> {code}
{code:java}
<!-- Processor chain -->
<updateRequestProcessorChain name="datafari-embed">
<processor
class="solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessorFactory">
<str name="inputField">embedded_content</str>
<str name="outputField">${texttovector.outputfield:vector_1536}</str>
<str name="model">${texttovector.model:default_model}</str>
</processor>
<processor
class="com.francelabs.datafari.updateprocessor.VectorTaggerUpdateProcessorFactory">
<str name="enabled">true</str>
<str name="outputField">${texttovector.outputfield:vector_1536}</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>{code}
* The *TextToVectorUpdateProcessor* takes the value of
"{*}embedded_content{*}", sends it to the external embeddings model (here, I'm
using our homemade [Datafari AI
Agent|https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/3522854915/AI+Agent+-+API+documentation]).
* The external embeddings service vectorize the content of the chunks, and
returns the vector.
* If the vectorisation is successful, the (homemade)
VectorTaggerUpdateProcessor adds the name of the output vector field in the
multivalued "{*}has_vector{*}" String field.
h2.
h2. The problem
At first look, the workflow described above seems to work just fine. However, I
noticed a significant issue: *the content received by the embeddings service is
different from the expected one.*
See the {*}actual AI Agent logs{*}:
{code:java}
2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST /embeddings
: 60
2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: {set=Lorem ipsum
dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... espace réservé
du code{code}
Here are the {*}expected logs{*}:
{code:java}
2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST /embeddings
: 60 2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: Lorem ipsum
dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... {code}
(x) It appears that the *TextToVectorUpdateProcessor* uses the "raw value" of
the embedded content ("{color:#0747a6}_{set=Lorem ipsum dolor...}_{color}")
instead of the actual value ("{color:#0747a6}_Lorem ipsum dolor_{color}")
h2. What does the doc says?
According to the [TextToVectorUpdateProcessor
documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time],
it is possible to use atomic update for embeddings.
I tried to follow the instructions:
* Using the existing /update/embed handler
* Creating the "vectorised" field
{code:java}
<field name="vectorised" type="boolean" uninvertible="false" docValues="true"
indexed="true" stored="false"/> {code}
* Sending an atomic update on an existing (not embedded) document:
{code:java}
curl -X POST "http://localhost:8983/solr/VectorMain/update/embed?commit=true" \
-H "Content-Type: application/json" \
-d '[
{
"id": "file://///localhost/mini/loremipsum.txt_0",
"vectorised":{"set":true}
}
]' {code}
According to the
[documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time],
the update processor {*}should retrieve the value from the document's
_embedded_content_{*}:
{quote}What will happen is that internally Solr fetches the stored content of
the docs to update, all the existing fields are retrieved and a re-indexing
happens, targeting the 'vectorisation' chain that will add the vector and set
the boolean 'vectorised' field to 'true'.
{quote}
However, here, *(!) it does not (!).*
The Solr response is OK. The request is logged:
{code:java}
INFO 2025-08-07T16:13:05Z
(searcherExecutor-102-thread-5-processing-VectorMain_shard1_replica_n1
127.0.0.1-90 core_node2 127.0.0.1:8983_solr VectorMain shard1) -
Solr|Solr|org.apache.solr.core.SolrCore|[VectorMain shard1 core_node2
VectorMain_shard1_replica_n1] o.a.s.c.SolrCore Registered new searcher autowarm
time: 0 ms INFO 2025-08-07T16:13:05Z (qtp1739267143-220-127.0.0.1-90) -
Solr|Solr|org.apache.solr.update.processor.LogUpdateProcessorFactory|[VectorMain
shard1 core_node2 VectorMain_shard1_replica_n1]
o.a.s.u.p.LogUpdateProcessorFactory webapp=/solr path=/update/embed
params={commit=true}{add=[file://///localhost/mini/loremipsum.txt_0
(1839813818592526336)], commit=} 0 71 {code}
However, if I don't provide the "embedded_content" in the request, the Update
Processor ignores it and don't call the external service.
h2. Suggestions
I tries many thinks to fix these two issues. Maybe I'm missing an important
point, but if I'm not, here are my suggestions.
* Handle "atomicly updated" fields as inputField in the
*TextToVectorUpdateProcessor.*
* Improve the processor to reload missing inputField from stored fields if not
provided.
* Alternatively, clarify documentation to indicate that partial updates must
still include inputField
If you have any question or remark, feel free to ask. Also, I'm open to any
idea or advice. Thanks for reading !
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]