Hello everyone, I am trying to clean source fields from HTML markup before indexing, using an Update Request Processor.
But no variation I try seems to work, and HTML markup is still being indexed. Would anyone have an idea about it? Thanks in advance! *indexing command* curl -X POST -H "Content-Type: application/csv" --data-binary @myfile.csv " http://localhost:8983/solr/mycore/update?commit=true" *managed-schema.xml* <fieldType name="text_general" class="solr.TextField" positionIncrementGap= "100" multiValued="true"> <analyzer type="index"> <tokenizer name="standard"/> <filter words="stopwords.txt" ignoreCase="true" name="stop"/> <filter name="lowercase"/> </analyzer> <analyzer type="query"> <tokenizer name="standard"/> <filter words="stopwords.txt" ignoreCase="true" name="stop"/> <filter name="synonymGraph" synonyms="synonyms.txt" ignoreCase="true" expand ="true"/> <filter name="lowercase"/> </analyzer> </fieldType> <field name="body" type="text_pt" indexed="true" stored="true"/> <copyField source="body" dest="catchall"/> *solrconfig.xml* <updateRequestProcessorChain> <processor class="solr.HTMLStripFieldUpdateProcessorFactory"> <str name="typeClass">solr.TextField</str> </processor> </updateRequestProcessorChain> References https://solr.apache.org/guide/solr/9_4/configuration-guide/update-request-processors.html https://solr.apache.org/docs/9_4_1/core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html https://solr.apache.org/docs/9_4_1/core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html