Emeric Bernet-Rollande created SOLR-17843:
---------------------------------------------

             Summary: TextToVectorUpdateProcessor does not work with atomic 
update
                 Key: SOLR-17843
                 URL: https://issues.apache.org/jira/browse/SOLR-17843
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: UpdateRequestProcessors, vector-search
    Affects Versions: 9.9
         Environment: I'm working on an *Ubuntu 22* VM, running a local 
Datafari 6.3-DEV server with {*}Solr 9.9{*}.

 
            Reporter: Emeric Bernet-Rollande
         Attachments: solrconfig.xml

Hi,

I'm working on *Solr 9.9* and using the *TextToVectorUpdateProcessor* to enrich 
documents with semantic vectors. However, I am facing an issue when I try to 
use this processor with {_}*atomic update*{_}.

 
h2. Full context: Indexing / embeddings workflow

Solr is installed as a component of a search engine, Datafari. In this 
scenario, Datafari crawles documents from a source (File share, web...), and 
index them in a *FileShare* collection.

This FileShare collection has an Update Processor that chunks all documents 
into smaller subdocuments (chunks), and send them to the *VectorMain* 
collection.

Now, I need to vectorise the content of the chunks, using the 
[TextToVectorUpdateProcessor|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html].

I used to call this processor in the main processor chain, so all incoming 
chunks were embedded. Most of the time, it worked well, but this solution has 
two major issues:
 * It significantly increases the indexing time
 * When an embedding fails for any reason (timeout, network error, LLM 
exception...), the associated chunk {*}was not indexed{*}.

That is why I decided to dissociate the indexing from embeddings using [Atomic 
Update|https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html].

Here is the new workflow:
 * Chunks are indexed in the VectorMain collection without embeddings. The text 
content is stored in the "{_}*embedded_content*{_}" field.

{code:java}
<field name="embedded_content" type="text_general" indexed="true" stored="true" 
multiValued="false"/> {code}
 * Then, we manually job the {*}Atomic Updates Jobs{*}, that retrieves all the 
documents from VectorMain, and sends update requests to each of them using the 
"{_}*/update/embed*{_}" handler. Here is what the requests look like: 

{code:java}
[
    ....
    {
        "id": "file://///localhost/dataset/my_document.txt_4",
        "embedded_content": { "set": "Lorem ipsum dolor sit amet, consectetur 
adipiscing elit. Aenean aliquet quam sed convallis malesuada." }
    },
    ...
]{code}
 And here is the handler & processor chain:
{code:java}
<!-- Request handler -->
<requestHandler class="solr.UpdateRequestHandler" name="/update/embed">
    <lst name="defaults">
        <str name="lowernames">true</str>
        <str name="fmap.language">ignored_</str>
        <str name="fmap.source">ignored_</str>
        <str name="fmap.version">ignored_</str>
        <str name="fmap._version_">ignored_</str>
        <str name="uprefix">ignored_</str>
        <str name="update.chain">datafari-embed</str>
    </lst>
</requestHandler> {code}
{code:java}
<!-- Processor chain -->
<updateRequestProcessorChain name="datafari-embed">
    <processor 
class="solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessorFactory">
        <str name="inputField">embedded_content</str>
        <str name="outputField">${texttovector.outputfield:vector_1536}</str>
        <str name="model">${texttovector.model:default_model}</str>
    </processor>
    <processor 
class="com.francelabs.datafari.updateprocessor.VectorTaggerUpdateProcessorFactory">
        <str name="enabled">true</str>
        <str name="outputField">${texttovector.outputfield:vector_1536}</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>{code}
 * The *TextToVectorUpdateProcessor* takes the value of 
"{*}embedded_content{*}", sends it to the external embeddings model (here, I'm 
using our homemade [Datafari AI 
Agent|https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/3522854915/AI+Agent+-+API+documentation]).
 * The external embeddings service vectorize the content of the chunks, and 
returns the vector.
 * If the vectorisation is successful, the (homemade) 
VectorTaggerUpdateProcessor adds the name of the output vector field in the 
multivalued "{*}has_vector{*}" String field.

h2.  
h2. The problem

At first look, the workflow described above seems to work just fine. However, I 
noticed a significant issue: *the content received by the embeddings service is 
different from the expected one.*

See the {*}actual AI Agent logs{*}:
{code:java}
2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST /embeddings 
: 60
2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: {set=Lorem ipsum 
dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... espace réservé 
du code{code}
Here are the {*}expected logs{*}:
{code:java}
2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST /embeddings 
: 60 2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: Lorem ipsum 
dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... {code}
(x) It appears that the *TextToVectorUpdateProcessor* uses the "raw value" of 
the embedded content ("{color:#0747a6}_{set=Lorem ipsum dolor...}_{color}") 
instead of the actual value ("{color:#0747a6}_Lorem ipsum dolor_{color}")

 
h2. What does the doc says?

According to the [TextToVectorUpdateProcessor 
documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time],
 it is possible to use atomic update for embeddings.

I tried to follow the instructions:
 * Using the existing /update/embed handler
 * Creating the "vectorised" field

{code:java}
<field name="vectorised" type="boolean" uninvertible="false" docValues="true" 
indexed="true" stored="false"/> {code}

 * Sending an atomic update on an existing (not embedded) document:

{code:java}
curl -X POST "http://localhost:8983/solr/VectorMain/update/embed?commit=true"; \
  -H "Content-Type: application/json" \
  -d '[
    {
        "id": "file://///localhost/mini/loremipsum.txt_0",
        "vectorised":{"set":true}
    }
  ]' {code}

 

 

According to the 
[documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time],
 the update processor {*}should retrieve the value from the document's 
_embedded_content_{*}:
{quote}What will happen is that internally Solr fetches the stored content of 
the docs to update, all the existing fields are retrieved and a re-indexing 
happens, targeting the 'vectorisation' chain that will add the vector and set 
the boolean 'vectorised' field to 'true'.
{quote}
However, here, *(!) it does not (!).* 

The Solr response is OK. The request is logged:
{code:java}
 INFO 2025-08-07T16:13:05Z 
(searcherExecutor-102-thread-5-processing-VectorMain_shard1_replica_n1 
127.0.0.1-90 core_node2 127.0.0.1:8983_solr VectorMain shard1) - 
Solr|Solr|org.apache.solr.core.SolrCore|[VectorMain shard1 core_node2 
VectorMain_shard1_replica_n1] o.a.s.c.SolrCore Registered new searcher autowarm 
time: 0 ms INFO 2025-08-07T16:13:05Z (qtp1739267143-220-127.0.0.1-90) - 
Solr|Solr|org.apache.solr.update.processor.LogUpdateProcessorFactory|[VectorMain
 shard1 core_node2 VectorMain_shard1_replica_n1] 
o.a.s.u.p.LogUpdateProcessorFactory webapp=/solr path=/update/embed 
params={commit=true}{add=[file://///localhost/mini/loremipsum.txt_0 
(1839813818592526336)], commit=} 0 71 {code}
However, if I don't provide the "embedded_content" in the request, the Update 
Processor ignores it and don't call the external service.

 

 
h2. Suggestions

I tries many thinks to fix these two issues. Maybe I'm missing an important 
point, but if I'm not, here are my suggestions.
 * Handle "atomicly updated" fields as inputField in the 
*TextToVectorUpdateProcessor.*
 * Improve the processor to reload missing inputField from stored fields if not 
provided.
 * Alternatively, clarify documentation to indicate that partial updates must 
still include inputField

 

If you have any question or remark, feel free to ask. Also, I'm open to any 
idea or advice. Thanks for reading !



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to