[jira] [Commented] (SOLR-17843) TextToVectorUpdateProcessor does not work with partial update

ilariapet (Jira) Thu, 06 Nov 2025 01:46:04 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-17843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035780#comment-18035780
 ]


ilariapet commented on SOLR-17843:
----------------------------------

Is anyone already working on this? Otherwise, I will do it.

> TextToVectorUpdateProcessor does not work with partial update
> -------------------------------------------------------------
>
>                 Key: SOLR-17843
>                 URL: https://issues.apache.org/jira/browse/SOLR-17843
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: UpdateRequestProcessors, vector-search
>    Affects Versions: 9.9
>         Environment: I'm working on an *Ubuntu 22* VM, running a local 
> Datafari 6.3-DEV server with {*}Solr 9.9{*}.
>  
>            Reporter: Emeric Bernet-Rollande
>            Priority: Minor
>         Attachments: solrconfig.xml
>
>
> Hi,
> I'm working on *Solr 9.9* and using the *TextToVectorUpdateProcessor* to 
> enrich documents with semantic vectors. However, I am facing an issue when I 
> try to use this processor with {_}*atomic update*{_}.
>  
> h2. Full context: Indexing / embeddings workflow
> Solr is installed as a component of a search engine, Datafari. In this 
> scenario, Datafari crawles documents from a source (File share, web...), and 
> index them in a *FileShare* collection.
> This FileShare collection has an Update Processor that chunks all documents 
> into smaller subdocuments (chunks), and send them to the *VectorMain* 
> collection.
> Now, I need to vectorise the content of the chunks, using the 
> [TextToVectorUpdateProcessor|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html].
> I used to call this processor in the main processor chain, so all incoming 
> chunks were embedded. Most of the time, it worked well, but this solution has 
> two major issues:
>  * It significantly increases the indexing time
>  * When an embedding fails for any reason (timeout, network error, LLM 
> exception...), the associated chunk {*}was not indexed{*}.
> That is why I decided to dissociate the indexing from embeddings using 
> [Atomic 
> Update|https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html].
> Here is the new workflow:
>  * Chunks are indexed in the VectorMain collection without embeddings. The 
> text content is stored in the "{_}*embedded_content*{_}" field.
> {code:java}
> <field name="embedded_content" type="text_general" indexed="true" 
> stored="true" multiValued="false"/> {code}
>  * Then, we manually job the {*}Atomic Updates Jobs{*}, that retrieves all 
> the documents from VectorMain, and sends update requests to each of them 
> using the "{_}*/update/embed*{_}" handler. Here is what the requests look 
> like: 
> {code:java}
> [
>     ....
>     {
>               "id": "file://///localhost/dataset/my_document.txt_4",
>         "embedded_content": { "set": "Lorem ipsum dolor sit amet, consectetur 
> adipiscing elit. Aenean aliquet quam sed convallis malesuada." }
>     },
>     ...
> ]{code}
>  And here is the handler & processor chain:
> {code:java}
> <!-- Request handler -->
> <requestHandler class="solr.UpdateRequestHandler" name="/update/embed">
>     <lst name="defaults">
>         <str name="lowernames">true</str>
>         <str name="fmap.language">ignored_</str>
>         <str name="fmap.source">ignored_</str>
>         <str name="fmap.version">ignored_</str>
>         <str name="fmap._version_">ignored_</str>
>         <str name="uprefix">ignored_</str>
>         <str name="update.chain">datafari-embed</str>
>     </lst>
> </requestHandler> {code}
> {code:java}
> <!-- Processor chain -->
> <updateRequestProcessorChain name="datafari-embed">
>     <processor 
> class="solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessorFactory">
>         <str name="inputField">embedded_content</str>
>         <str name="outputField">${texttovector.outputfield:vector_1536}</str>
>         <str name="model">${texttovector.model:default_model}</str>
>     </processor>
>     <processor 
> class="com.francelabs.datafari.updateprocessor.VectorTaggerUpdateProcessorFactory">
>         <str name="enabled">true</str>
>         <str name="outputField">${texttovector.outputfield:vector_1536}</str>
>     </processor>
>     <processor class="solr.LogUpdateProcessorFactory"/>
>     <processor class="solr.DistributedUpdateProcessorFactory"/>
>     <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>{code}
>  * The *TextToVectorUpdateProcessor* takes the value of 
> "{*}embedded_content{*}", sends it to the external embeddings model (here, 
> I'm using our homemade [Datafari AI 
> Agent|https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/3522854915/AI+Agent+-+API+documentation]).
>  * The external embeddings service vectorize the content of the chunks, and 
> returns the vector.
>  * If the vectorisation is successful, the (homemade) 
> VectorTaggerUpdateProcessor adds the name of the output vector field in the 
> multivalued "{*}has_vector{*}" String field.
> h2.  
> h2. The problem
> At first look, the workflow described above seems to work just fine. However, 
> I noticed a significant issue: *the content received by the embeddings 
> service is different from the expected one.*
> See the {*}actual AI Agent logs{*}:
> {code:java}
> 2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST 
> /embeddings : 60
> 2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: {set=Lorem ipsum 
> dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... espace 
> réservé du code{code}
> Here are the {*}expected logs{*}:
> {code:java}
> 2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST 
> /embeddings : 60 2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: 
> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... 
> {code}
> (x) It appears that the *TextToVectorUpdateProcessor* uses the "raw value" of 
> the embedded content ("{color:#0747a6}_{set=Lorem ipsum dolor...}_{color}") 
> instead of the actual value ("{color:#0747a6}_Lorem ipsum dolor_{color}")
>  
> h2. What does the doc says?
> According to the [TextToVectorUpdateProcessor 
> documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time],
>  it is possible to use atomic update for embeddings.
> I tried to follow the instructions:
>  * Using the existing /update/embed handler
>  * Creating the "vectorised" field
> {code:java}
> <field name="vectorised" type="boolean" uninvertible="false" docValues="true" 
> indexed="true" stored="false"/> {code}
>  * Sending an atomic update on an existing (not embedded) document:
> {code:java}
> curl -X POST "http://localhost:8983/solr/VectorMain/update/embed?commit=true"; 
> \
>   -H "Content-Type: application/json" \
>   -d '[
>     {
>               "id": "file://///localhost/mini/loremipsum.txt_0",
>       "vectorised":{"set":true}
>     }
>   ]' {code}
>  
>  
> According to the 
> [documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time],
>  the update processor {*}should retrieve the value from the document's 
> _embedded_content_{*}:
> {quote}What will happen is that internally Solr fetches the stored content of 
> the docs to update, all the existing fields are retrieved and a re-indexing 
> happens, targeting the 'vectorisation' chain that will add the vector and set 
> the boolean 'vectorised' field to 'true'.
> {quote}
> However, here, *(!) it does not (!).* 
> The Solr response is OK. The request is logged:
> {code:java}
>  INFO 2025-08-07T16:13:05Z 
> (searcherExecutor-102-thread-5-processing-VectorMain_shard1_replica_n1 
> 127.0.0.1-90 core_node2 127.0.0.1:8983_solr VectorMain shard1) - 
> Solr|Solr|org.apache.solr.core.SolrCore|[VectorMain shard1 core_node2 
> VectorMain_shard1_replica_n1] o.a.s.c.SolrCore Registered new searcher 
> autowarm time: 0 ms INFO 2025-08-07T16:13:05Z 
> (qtp1739267143-220-127.0.0.1-90) - 
> Solr|Solr|org.apache.solr.update.processor.LogUpdateProcessorFactory|[VectorMain
>  shard1 core_node2 VectorMain_shard1_replica_n1] 
> o.a.s.u.p.LogUpdateProcessorFactory webapp=/solr path=/update/embed 
> params={commit=true}{add=[file://///localhost/mini/loremipsum.txt_0 
> (1839813818592526336)], commit=} 0 71 {code}
> However, if I don't provide the "embedded_content" in the request, the Update 
> Processor ignores it and don't call the external service.
>  
>  
> h2. Suggestions
> I tries many thinks to fix these two issues. Maybe I'm missing an important 
> point, but if I'm not, here are my suggestions.
>  * Handle "atomicly updated" fields as inputField in the 
> *TextToVectorUpdateProcessor.*
>  * Improve the processor to reload missing inputField from stored fields if 
> not provided.
>  * Alternatively, clarify documentation to indicate that partial updates must 
> still include inputField
>  
> If you have any question or remark, feel free to ask. Also, I'm open to any 
> idea or advice. Thanks for reading !



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-17843) TextToVectorUpdateProcessor does not work with partial update

Reply via email to