Hi Erick,
Thanks for you reply, now I get the point why I can not get the search
result. But can you guide me how can I use Lucene to implement the following
search feature:
Basically we can call this feature "fuzzy phrase search", which means the
search phrase may contains more words or less word
Hi,
Just to let everyone know Manning have released an extra chapter from
the excellent LIA 2E book, discussing CLucene - the C++ port of Lucene.
It is available for free at
http://www.code972.com/blog/2010/06/lucene-in-action-free-chapter-coupon-code/.
35% discount for CLucene users is av
Boris, you might wanna look at http://code.google.com/p/boilerpipe/
simon
On Mon, Jun 28, 2010 at 10:48 PM, Boris Aleksandrovsky
wrote:
> Thanks, Sashi, I am asking more about a general library which will remove
> those HTML element which are unwanted/useless for indexing. For instance, we
> are
Thanks, Sashi, I am asking more about a general library which will remove
those HTML element which are unwanted/useless for indexing. For instance, we
are using a general method to remove headers by comparing the structure of
HTML on the top-level document from the site (e.g. www.nytimes.com) and t
I have used TagSoup to parse the HTML and get the elements of interest.
http://ccil.org/~cowan/XML/tagsoup/
On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky
wrote:
> I was wondering if any of you know of any open-source solutions for general
> issues which arise in web crawling - how do yo
I was wondering if any of you know of any open-source solutions for general
issues which arise in web crawling - how do you remove
headers/footers/javascript and generally cleanup html of a web-page before
indexing? We have a first-pass solution implemented using custom code, but
this must be a pro
No, I don't think so. The critical bit is that the indexed text
does NOT contain the word "formal". So searching for
any phrase that DOES contain "formal" should fail no matter
what the slop.
Phrase queries are something like "find all the words in this
search string, ignoring some number of inte
Hey Erick
Thanks mate!
So I guess my explanation in the mail chain above was correct!
On Mon, Jun 28, 2010 at 6:20 AM, Erick Erickson wrote:
> I think you're misunderstanding the intent of PhraseQueries and slop. Slop
> is the number of intervening tokens that may exist between the words
> you'
I think you're misunderstanding the intent of PhraseQueries and slop. Slop
is the number of intervening tokens that may exist between the words
you're looking for. However, all the words you're looking for MUST exist.
So,
<<< whenever the search phrase contains a word that don't
exist in the docum
Hi,
My test result is that whenever the search phrase contains a word that don't
exist in the document, the search result will be empty no matter how big the
slop factor I set, seems this is a bug of Lucene, or it is work as design?
2010/6/28 tarun sapra
> Hi ,
>
> I think I have been able to u
10 matches
Mail list logo