Hello everyone,

I am trying to clean source fields from HTML markup before indexing, using
an Update Request Processor.

But no variation I try seems to work, and HTML markup is still being
indexed.

Would anyone have an idea about it?

Thanks in advance!

*indexing command*
curl -X POST -H "Content-Type: application/csv" --data-binary @myfile.csv "
http://localhost:8983/solr/mycore/update?commit=true";

*managed-schema.xml*
<fieldType name="text_general" class="solr.TextField" positionIncrementGap=
"100" multiValued="true">
<analyzer type="index">
<tokenizer name="standard"/>
<filter words="stopwords.txt" ignoreCase="true" name="stop"/>
<filter name="lowercase"/>
</analyzer>
<analyzer type="query">
<tokenizer name="standard"/>
<filter words="stopwords.txt" ignoreCase="true" name="stop"/>
<filter name="synonymGraph" synonyms="synonyms.txt" ignoreCase="true" expand
="true"/>
<filter name="lowercase"/>
</analyzer>
</fieldType>
<field name="body" type="text_pt" indexed="true" stored="true"/>
<copyField source="body" dest="catchall"/>

*solrconfig.xml*
<updateRequestProcessorChain>
<processor class="solr.HTMLStripFieldUpdateProcessorFactory">
<str name="typeClass">solr.TextField</str>
</processor>
</updateRequestProcessorChain>

References
https://solr.apache.org/guide/solr/9_4/configuration-guide/update-request-processors.html
https://solr.apache.org/docs/9_4_1/core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
https://solr.apache.org/docs/9_4_1/core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html

Reply via email to