Re: newbie lucene indexing/search question

Erick Erickson Fri, 29 Dec 2006 09:17:38 -0800

It all depends upon how you index it <G>.... There are at least three
approaches.


1> each paragraph is a distinct Lucene document. You'd also index some data
with each paragraph that allows you to reconstruct what book it came from.
What relevance means on a per-book basis is a question you need to give some
thought to, but searching on the text field will search all your paragraphs.
This assumes that each lucene document indexes the paragraph in a field
labeled "text".

2> You index all the paragraphs of a single book together in one lucene
document. This really has two variants
2a> each paragraph gets it's own field in the document, i.e. fields like
paragraph1, paragraph2. I wouldn't do this since the queries you have to
construct are ugly.
2b> you index all the paragraphs into a single field per book, call it
"text". Now, searching against the text field will search all the paragraphs
you put in there. BTW, the following are equivalent (and note it's pseudo
code)

doc = new doc();
doc.add("text" "some text");
doc.add("text", "more stuff");
writer.add(doc);

and

doc = new doc();
doc.add("text", "some text more stuff");
writer.add(doc);

You can play some games with the offsets of the first word of each paragraph
if you need to know which paragraph the data was in (in the first form above
only). Search the mail archive for  PositionIncrementGap (?). The idea is
that your text position for the last token in, say, paragraph 1 is 129. You
can cause the first word of the next paragraph have an offset of 1,000, say.
This has some interesting ramifications about whether you want, say, phrase
searches to span pages or not, but that's another discussion....

And be aware that by default Lucene only indexes the first 10,000 tokens in
a field. You can set this as high as you need to but you have to do it
intentionally...

Best
Erick


On 12/28/06, moraleslos <[EMAIL PROTECTED]> wrote:



I currently have a book containing content that is stored in the database
by
paragraph. For example, a book contains content with 5 paragraphs.
Therefore
each paragraph is stored as a distinct record in a database. In the object
domain, I have a Book object which holds a java.util.List of Paragraph
objects. In the relational world, this would be a One-to-Many for
book-paragraph.

Now, if I search for specific words against the Book's contents, will it
retrieve all of the paragraphs, combine them and then do the search, or
will
it only search on a paragraph? For example, a "Guitar" book contains two
paragraphs like this:

paragraph 1: This is the first paragraph for learn
paragraph 2: guitar and other musical instruments.

Therefore there will be a record in the Book table linked with two records
in the Paragraph table. Now say I index the book and paragraph fields as
is
and then have a lucene query that looks like this: [book:Guitar
paragraph:"learn guitar"]. Will this query return a hit?

Thanks in advance!

-los
--
View this message in context:
http://www.nabble.com/newbie-lucene-indexing-search-question-tf2892417.html#a8080965
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: newbie lucene indexing/search question

Reply via email to