[
https://issues.apache.org/jira/browse/SOLR-17843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035780#comment-18035780
]
ilariapet commented on SOLR-17843:
----------------------------------
Is anyone already working on this? Otherwise, I will do it.
> TextToVectorUpdateProcessor does not work with partial update
> -------------------------------------------------------------
>
> Key: SOLR-17843
> URL: https://issues.apache.org/jira/browse/SOLR-17843
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: UpdateRequestProcessors, vector-search
> Affects Versions: 9.9
> Environment: I'm working on an *Ubuntu 22* VM, running a local
> Datafari 6.3-DEV server with {*}Solr 9.9{*}.
>
> Reporter: Emeric Bernet-Rollande
> Priority: Minor
> Attachments: solrconfig.xml
>
>
> Hi,
> I'm working on *Solr 9.9* and using the *TextToVectorUpdateProcessor* to
> enrich documents with semantic vectors. However, I am facing an issue when I
> try to use this processor with {_}*atomic update*{_}.
>
> h2. Full context: Indexing / embeddings workflow
> Solr is installed as a component of a search engine, Datafari. In this
> scenario, Datafari crawles documents from a source (File share, web...), and
> index them in a *FileShare* collection.
> This FileShare collection has an Update Processor that chunks all documents
> into smaller subdocuments (chunks), and send them to the *VectorMain*
> collection.
> Now, I need to vectorise the content of the chunks, using the
> [TextToVectorUpdateProcessor|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html].
> I used to call this processor in the main processor chain, so all incoming
> chunks were embedded. Most of the time, it worked well, but this solution has
> two major issues:
> * It significantly increases the indexing time
> * When an embedding fails for any reason (timeout, network error, LLM
> exception...), the associated chunk {*}was not indexed{*}.
> That is why I decided to dissociate the indexing from embeddings using
> [Atomic
> Update|https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html].
> Here is the new workflow:
> * Chunks are indexed in the VectorMain collection without embeddings. The
> text content is stored in the "{_}*embedded_content*{_}" field.
> {code:java}
> <field name="embedded_content" type="text_general" indexed="true"
> stored="true" multiValued="false"/> {code}
> * Then, we manually job the {*}Atomic Updates Jobs{*}, that retrieves all
> the documents from VectorMain, and sends update requests to each of them
> using the "{_}*/update/embed*{_}" handler. Here is what the requests look
> like:
> {code:java}
> [
> ....
> {
> "id": "file://///localhost/dataset/my_document.txt_4",
> "embedded_content": { "set": "Lorem ipsum dolor sit amet, consectetur
> adipiscing elit. Aenean aliquet quam sed convallis malesuada." }
> },
> ...
> ]{code}
> And here is the handler & processor chain:
> {code:java}
> <!-- Request handler -->
> <requestHandler class="solr.UpdateRequestHandler" name="/update/embed">
> <lst name="defaults">
> <str name="lowernames">true</str>
> <str name="fmap.language">ignored_</str>
> <str name="fmap.source">ignored_</str>
> <str name="fmap.version">ignored_</str>
> <str name="fmap._version_">ignored_</str>
> <str name="uprefix">ignored_</str>
> <str name="update.chain">datafari-embed</str>
> </lst>
> </requestHandler> {code}
> {code:java}
> <!-- Processor chain -->
> <updateRequestProcessorChain name="datafari-embed">
> <processor
> class="solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessorFactory">
> <str name="inputField">embedded_content</str>
> <str name="outputField">${texttovector.outputfield:vector_1536}</str>
> <str name="model">${texttovector.model:default_model}</str>
> </processor>
> <processor
> class="com.francelabs.datafari.updateprocessor.VectorTaggerUpdateProcessorFactory">
> <str name="enabled">true</str>
> <str name="outputField">${texttovector.outputfield:vector_1536}</str>
> </processor>
> <processor class="solr.LogUpdateProcessorFactory"/>
> <processor class="solr.DistributedUpdateProcessorFactory"/>
> <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>{code}
> * The *TextToVectorUpdateProcessor* takes the value of
> "{*}embedded_content{*}", sends it to the external embeddings model (here,
> I'm using our homemade [Datafari AI
> Agent|https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/3522854915/AI+Agent+-+API+documentation]).
> * The external embeddings service vectorize the content of the chunks, and
> returns the vector.
> * If the vectorisation is successful, the (homemade)
> VectorTaggerUpdateProcessor adds the name of the output vector field in the
> multivalued "{*}has_vector{*}" String field.
> h2.
> h2. The problem
> At first look, the workflow described above seems to work just fine. However,
> I noticed a significant issue: *the content received by the embeddings
> service is different from the expected one.*
> See the {*}actual AI Agent logs{*}:
> {code:java}
> 2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST
> /embeddings : 60
> 2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: {set=Lorem ipsum
> dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... espace
> réservé du code{code}
> Here are the {*}expected logs{*}:
> {code:java}
> 2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST
> /embeddings : 60 2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60:
> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean aliquet ...
> {code}
> (x) It appears that the *TextToVectorUpdateProcessor* uses the "raw value" of
> the embedded content ("{color:#0747a6}_{set=Lorem ipsum dolor...}_{color}")
> instead of the actual value ("{color:#0747a6}_Lorem ipsum dolor_{color}")
>
> h2. What does the doc says?
> According to the [TextToVectorUpdateProcessor
> documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time],
> it is possible to use atomic update for embeddings.
> I tried to follow the instructions:
> * Using the existing /update/embed handler
> * Creating the "vectorised" field
> {code:java}
> <field name="vectorised" type="boolean" uninvertible="false" docValues="true"
> indexed="true" stored="false"/> {code}
> * Sending an atomic update on an existing (not embedded) document:
> {code:java}
> curl -X POST "http://localhost:8983/solr/VectorMain/update/embed?commit=true"
> \
> -H "Content-Type: application/json" \
> -d '[
> {
> "id": "file://///localhost/mini/loremipsum.txt_0",
> "vectorised":{"set":true}
> }
> ]' {code}
>
>
> According to the
> [documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time],
> the update processor {*}should retrieve the value from the document's
> _embedded_content_{*}:
> {quote}What will happen is that internally Solr fetches the stored content of
> the docs to update, all the existing fields are retrieved and a re-indexing
> happens, targeting the 'vectorisation' chain that will add the vector and set
> the boolean 'vectorised' field to 'true'.
> {quote}
> However, here, *(!) it does not (!).*
> The Solr response is OK. The request is logged:
> {code:java}
> INFO 2025-08-07T16:13:05Z
> (searcherExecutor-102-thread-5-processing-VectorMain_shard1_replica_n1
> 127.0.0.1-90 core_node2 127.0.0.1:8983_solr VectorMain shard1) -
> Solr|Solr|org.apache.solr.core.SolrCore|[VectorMain shard1 core_node2
> VectorMain_shard1_replica_n1] o.a.s.c.SolrCore Registered new searcher
> autowarm time: 0 ms INFO 2025-08-07T16:13:05Z
> (qtp1739267143-220-127.0.0.1-90) -
> Solr|Solr|org.apache.solr.update.processor.LogUpdateProcessorFactory|[VectorMain
> shard1 core_node2 VectorMain_shard1_replica_n1]
> o.a.s.u.p.LogUpdateProcessorFactory webapp=/solr path=/update/embed
> params={commit=true}{add=[file://///localhost/mini/loremipsum.txt_0
> (1839813818592526336)], commit=} 0 71 {code}
> However, if I don't provide the "embedded_content" in the request, the Update
> Processor ignores it and don't call the external service.
>
>
> h2. Suggestions
> I tries many thinks to fix these two issues. Maybe I'm missing an important
> point, but if I'm not, here are my suggestions.
> * Handle "atomicly updated" fields as inputField in the
> *TextToVectorUpdateProcessor.*
> * Improve the processor to reload missing inputField from stored fields if
> not provided.
> * Alternatively, clarify documentation to indicate that partial updates must
> still include inputField
>
> If you have any question or remark, feel free to ask. Also, I'm open to any
> idea or advice. Thanks for reading !
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]