On Tue, Oct 27, 2009 at 21:21, Jake Mannix <jake.man...@gmail.com> wrote: > On Tue, Oct 27, 2009 at 6:12 PM, Erick Erickson > <erickerick...@gmail.com>wrote: > >> Could you go into your use case a bit more? Because I'm confused. >> Why don't you want your text tokenized? You say you want to search it, >> which means you have to analyze it. > > > I think Will is suggesting that he doesn't want to have to analyze it > *again* - > if he really has different fields for every tag type, it would get > prohibitively > expensive in terms of Indexing CPU usage to retokenize over and over > again. > > Is that what your concern is, Will? More or less. Different types of tags need different tokenization: just as an example, I want to parse an img tag which contains a src attribute as a URL, and tokenize the URL as such (i.e., even if there are spaces they're treated as a unit), but the contents of a paragraph must be tokenized as English text.
So I think the solution (because there's only one Analyzer per IndexWriter, and thus per document) is to do all the field-type-specific stuff outside of Lucene, and then use a very generic Analyzer, like the "\0"-splitter mentioned above. On Tue, Oct 27, 2009 at 21:12, Erick Erickson <erickerick...@gmail.com> wrote: > If you need different analyzers for each field, see PerFieldAnalyzerWrapper. That's very close to what I need, but I don't think it lines up quite right. When I find some tokens inside an h1 tag (assume for simplicity that I only need to consider the innermost tag around a particular element) they won't be in the category for things-inside-h2-tags. So I think trying to find all the things that are in h1 tags in one pass through the DOM tree, then things in h2 tags in another, and so forth, will be slower than traversing the tree once and filing everything in its place myself, then feeding each list into Lucene as a field. So, in other words, I think using an individual Analyzer for each type of tag will be inefficient, so I'll run one big Analyzer, then put its results into Lucene. Will --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org