Re: Tokenizing XML

2010-10-15 Thread Erick Erickson
Well, it's hard to say what "correctly" would be. Remove all XML? Preserve attributes? Preserve tags? Put the attributes and values into fields in the document? My point is that there's no obviously "correct" parsing. But if you just want to strip out all the <>, it seems like PatternTokenizer

Tokenizing XML

2010-10-15 Thread Christoph Hermann
Hi, is there a Tokenizer in Lucene, that tokenizes XML correctly? I.e. that one gets from the following XML: this is exampletext. Tokens (or similar): | this | is | | example | | text. | Or would i need to write such a Tokenizer myself? regards Christoph Hermann -- Christoph Hermann Inst