Re: A question regarding the setSlop method of class PhraseQuery (Lucene version 3.0.1)

2010-06-28 Thread a peng
Hi Erick, Thanks for you reply, now I get the point why I can not get the search result. But can you guide me how can I use Lucene to implement the following search feature: Basically we can call this feature "fuzzy phrase search", which means the search phrase may contains more words or less word

Lucene In Action free chapter on CLucene

2010-06-28 Thread Itamar Syn-Hershko
Hi, Just to let everyone know Manning have released an extra chapter from the excellent LIA 2E book, discussing CLucene - the C++ port of Lucene. It is available for free at http://www.code972.com/blog/2010/06/lucene-in-action-free-chapter-coupon-code/. 35% discount for CLucene users is av

Re: header/footer identification and general scaping tools

2010-06-28 Thread Simon Willnauer
Boris, you might wanna look at http://code.google.com/p/boilerpipe/ simon On Mon, Jun 28, 2010 at 10:48 PM, Boris Aleksandrovsky wrote: > Thanks, Sashi, I am asking more about a general library which will remove > those HTML element which are unwanted/useless for indexing. For instance, we > are

Re: header/footer identification and general scaping tools

2010-06-28 Thread Boris Aleksandrovsky
Thanks, Sashi, I am asking more about a general library which will remove those HTML element which are unwanted/useless for indexing. For instance, we are using a general method to remove headers by comparing the structure of HTML on the top-level document from the site (e.g. www.nytimes.com) and t

Re: header/footer identification and general scaping tools

2010-06-28 Thread Shashi Kant
I have used TagSoup to parse the HTML and get the elements of interest. http://ccil.org/~cowan/XML/tagsoup/ On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky wrote: > I was wondering if any of you know of any open-source solutions for general > issues which arise in web crawling - how do yo

header/footer identification and general scaping tools

2010-06-28 Thread Boris Aleksandrovsky
I was wondering if any of you know of any open-source solutions for general issues which arise in web crawling - how do you remove headers/footers/javascript and generally cleanup html of a web-page before indexing? We have a first-pass solution implemented using custom code, but this must be a pro

Re: A question regarding the setSlop method of class PhraseQuery (Lucene version 3.0.1)

2010-06-28 Thread Erick Erickson
No, I don't think so. The critical bit is that the indexed text does NOT contain the word "formal". So searching for any phrase that DOES contain "formal" should fail no matter what the slop. Phrase queries are something like "find all the words in this search string, ignoring some number of inte

Re: A question regarding the setSlop method of class PhraseQuery (Lucene version 3.0.1)

2010-06-28 Thread tarun sapra
Hey Erick Thanks mate! So I guess my explanation in the mail chain above was correct! On Mon, Jun 28, 2010 at 6:20 AM, Erick Erickson wrote: > I think you're misunderstanding the intent of PhraseQueries and slop. Slop > is the number of intervening tokens that may exist between the words > you'

Re: A question regarding the setSlop method of class PhraseQuery (Lucene version 3.0.1)

2010-06-28 Thread Erick Erickson
I think you're misunderstanding the intent of PhraseQueries and slop. Slop is the number of intervening tokens that may exist between the words you're looking for. However, all the words you're looking for MUST exist. So, <<< whenever the search phrase contains a word that don't exist in the docum

Re: A question regarding the setSlop method of class PhraseQuery (Lucene version 3.0.1)

2010-06-28 Thread a peng
Hi, My test result is that whenever the search phrase contains a word that don't exist in the document, the search result will be empty no matter how big the slop factor I set, seems this is a bug of Lucene, or it is work as design? 2010/6/28 tarun sapra > Hi , > > I think I have been able to u