Hello everyone,

 

I hope I’m posting this to the right place. I’ve been working extensively
with the TextToVector features from the Solr-LLM module. I use the Update
Processor to embed chunks from crawled documents, into a search engine. 

For performance reasons, I had to rework my process to use atomic update,
instead of embeddings all documents at indexing.

 

Here is the processor chain I use (in solrconfig.xml): 

 

    <updateRequestProcessorChain name="datafari-embed">

        <processor
class="solr.llm.texttovector.update.processor.TextToVectorUpdateProcessorFac
tory">

            <str name="inputField">embedded_content</str>

            <str
name="outputField">${texttovector.outputfield:vector_1536}</str>

            <str name="model">${texttovector.model:default_model}</str>

        </processor>

        <processor
class="com.francelabs.datafari.updateprocessor.TextToVectorUpdateProcessorFa
ctory">

            <str name="enabled">true</str>

            <str
name="outputField">${texttovector.outputfield:vector_1536}</str>

        </processor>

        <processor class="solr.LogUpdateProcessorFactory"/>

        <processor class="solr.DistributedUpdateProcessorFactory"/>

        <processor class="solr.RunUpdateProcessorFactory"/>

    </updateRequestProcessorChain>

 

Here is the workflow:

*       A background job crawls the collection, and sends atomic update
requests for each document.
*       These requests target the /update/embed endpoint, using the
processor chain above.
*       The request is processed, the embedded_content is embedded, and
stored in the outputField (knn dense vector field)

 

Here is an example of atomic update request, similar to those generated by
the job:

 

curl -X POST "http://localhost:8983/solr/VectorMain/update/embed"; \

  -H "Content-Type: application/json" \

  -d '[

    {

      "id": "file://///localhost/mini/helloworld.txt_0",

      "embedded_content": { "set": "Hello world" }

    }

  ]'

 

I use the langchain4J OpenAI to call my own LLM API to process the
embeddings. However, the embedding model receives “{set=Hello world}”
instead of just "Hello world", which breaks the semantic vector generation. 

 

For now, I am using Solr 9.8. I saw that Solr 9.9 documentation mentioned
partial update for vector embeddings
(https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html).
Has this issue been fixed in 9.9 ? Is there a recommended workaround or
patch to ensure that only the string value is passed to the embedding model,
and not the atomic update syntax itself?

 

Thank you !

 

Kind regards,

 

Emeric Bernet-Rollande 
France Labs – Your knowledge, now

Datafari Enterprise Search - Retrouvez-nous au salon
<https://www.bigdataparis.com/> Big Data & IA les 1 et 2 octobre à Paris,
stand C31

 <https://www.bigdataparis.com/> 

 

Reply via email to