I think you can use the HTMLStripWhitespaceTokenizerFactory. Look here :
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e I hope this helps On 27/08/07, Michael Kimsal <[EMAIL PROTECTED]> wrote: > > Hello > > I'm trying to index individual lines of an HTML file, and I'm hitting this > error: > > TEXT must be immediately followed by END_TAG and not START_TAG > > I've got something that looks like > > <add> > <doc> > <field name="id">4</field> > <field name="line"><a href="foobar"><b><i>linktext</i></b></a></field> > </doc> > </add> > > Actually, that sample code above, as its own data file POSTed to SOLR, > throws > > parser must be on START_TAG or TEXT to read text (position: START_TAG seen > ...<field name="line"><a href="foobar">... @4:37 > > as an error. > > Any clues as to how I can do this? I'd like to keep the original copy of > each line intact in the index. > > Thanks! > > -- > Michael Kimsal > http://webdevradio.com >
