Hello Erick,

I have been trying on Google Books some scenarios and apparently found a Google bug ...
It looks like they use number 2 approach, as this query illustrates it.

http://books.google.com/books?vid=ISBN1564968316&id=14Xx2T8tmMYC&pg=PA8&lpg=PA8&dq=%2B%22the+site+is+unburdened%22&sig=QRJSkKNLm0JlbkcWe2m1-y8YYz0

The phrase returns 2 hits, but if you look at the documents, only in the first one the phrase is visible.

Anyway, it makes possible finding something like:

http://books.google.com/books?q=%22sense+of+dissatisfaction+with+existing+elements%22&btnG=Search+Books
The returned page is the first one on which the phrase spans (but no more highlighting).

It seems we are really close to a good solution, now looking for a way to implementing it in terms of index structure.

Thanks again,
Mile Rosu


Erick Erickson wrote:
I can think of several approaches, but the experts will no doubt show me up
<G>..

1> index the entire book as a single document. Also, index the beginning and
ending offset of each page in separate "documents". Assuming you can find
the offset in the big doc of each matching phrase, you can also find out
what pages each match starts on and ends on, and if they are different you'd
know to display two pages. Not sure what this does to relevancy.......

2> Index, say, the 10 words on the previous page and 10 words on the next
page with the current page. You'd have to make sure your match wasn't
entirely within the 10 words you prepended or appended to the "match" page
(again by match position) when you returned data.

3> Have a series of "joiner" "documents". One for the 9 words of page n, and
9 words of  page n + 1 (along with the page number). Another set for 8
before and 8 after. etc. down to 1. If your phrase was 10 words, you'd
search your normal pages, and the 9 word "joiner" pages. Any match in the
joiners would be a page spanner. Again, what does that do to relevancy?


Note that there is no requirement that every document have the same fields, so your searches can be disjoint. Also, I'm assuming that you can reasonably decide that, say, 10 word phrases are the max you'll respect, which may not
be true.

I have no idea whether these are reasonable approaches given your problem
domain....

Best
Erick



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to