On Tue, Oct 27, 2009 at 19:17, Erick Erickson <erickerick...@gmail.com> wrote: > Unless I don't understand at all what you're going for, wouldn't > it work to just put the HTML through some kind of parser (strict or > loose depending on how well-formed your HTML is), then just > extract the text from your document and push them into your > Lucene document? Various parsers make this more or less > simple... That's more or less what I was suggesting. The problem as I see it is that Lucene wants to do its own tokenizing step. I declared my IndexWriter like this: writer = new IndexWriter(IndexDirectory, new MySpecialAnalyzer(), true, MaxFieldLength.UNLIMITED); and the code in the MySpecialAnalyzer class is indeed called later on.
So, I think this approach: > domObj = parse(htmldocument); > Document lucDoc = new Document(); > lucDoc.add("insideh1", domObj.getText(<dom path to H1>)); (etc) won't work, because when I put that text in it'll be analyzed again. Perhaps I'll write a ZeroSplittingAnalyzer or something, do all the work before I give anything to Lucene, then '\0'-join my tokens and feed them to the simple analyzer. So something like this: Document doc = new Document(); doc.add(new Field("h1", "hello\0world")); doc.add(new Field("alltext", "hello\0world\0goodnight\0moon")); I think that makes sense. Comments? Will > > HTH > Erick > > > On Tue, Oct 27, 2009 at 6:50 PM, Will Murnane <will.murn...@gmail.com>wrote: > >> Hello list, >> I have some semi-structured text that has some markup elements, and >> I want to put those elements into a separate field so I can search by >> them. For example (using HTML syntax): >> ---- 8< ---- document >> <h1>Section title</h1> >> Body content >> ---- >8 ---- >> I can find that the things inside <h1>s are "Section" and "title", and >> "Body" and "content" are outside. I want to create two fields for >> this document: >> insideh1 -> "Section", "title" >> alltext -> "Section", "title", "Body", "content" >> >> What's the best way to approach this? My initial thought is to make >> some kind of MultiAnalyzer that consumes the text and produces several >> token streams, which are added to the document one at a time. Is that >> a reasonable strategy? >> >> Thanks! >> Will >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org