Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread shrinath.m
Earl Hood wrote: > > Looks like Jericho does what you want already: > http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html > > --ewh > I went through their feature list and found that out :) http://jericho.htmlparser.net/docs/index.html Thanks Earl :) This i

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread Earl Hood
On Mon, Mar 14, 2011 at 11:46 PM, shrinath.m wrote: > I used Jericho and found it extremely simple to start with ... > > Just wanted to clarify one thing though. > Is there some tool that does extract text from HTML without creating the DOM Looks like Jericho does what you want already: http://je

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread shrinath.m
I started trying out all your suggestions one by one, thanks to all who helped. I used Jericho and found it extremely simple to start with ... Just wanted to clarify one thing though. Is there some tool that does extract text from HTML without creating the DOM ? -- Regards Shrinath.M -- View

Re: Analyzer enquiry

2011-03-14 Thread Vasiliki Gkouta
Thank you for your help! Best Regards, Vicky Quoting Erick Erickson : Nope, that should do it. Best Erick On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta wrote: Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use no stemmers. At the one analyzer I passed a german st

Re: lucene double metaphone ranking.

2011-03-14 Thread merlin.list
On Mar 14, 2011 5:10 PM, Paul Libbrecht wrote: Merlin, the kind of magic such as "prefer an exact match" still has to be programmed. Searching in a field with double-metaphone analyzer will only compare tokens by their double-metaphone-results. You probably want query expansion: text:picasso

Re: lucene double metaphone ranking.

2011-03-14 Thread Paul Libbrecht
Merlin, the kind of magic such as "prefer an exact match" still has to be programmed. Searching in a field with double-metaphone analyzer will only compare tokens by their double-metaphone-results. You probably want query expansion: text:picasso to be expanded to: text:picasso^3.0 text.stemm

lucene double metaphone ranking.

2011-03-14 Thread merlin.list
Hi guys, Here is my noob question: I'm trying to do fuzzy search on first name, last name. I'm using double metaphone analyzer. and i encountered the following problem. for example, when i search for picasso, "paski" shows up with the same score as the spelling of "picasso". when i look at

Re: Issue with disk space on UNIX

2011-03-14 Thread Ian Lea
Further to what Erick says, recent versions of lucene hang on to unclosed readers longer than old versions used to. lsof -p pid can be useful here: run it and grep for deleted files still being held open. -- Ian. On Mon, Mar 14, 2011 at 6:13 PM, Erick Erickson wrote: > This sounds like you're

Re: Issue with disk space on UNIX

2011-03-14 Thread Erick Erickson
This sounds like you're not closing your index searchers and the file system is keeping them around. On the Unix box, does hour index space reappear just by restarting the process? Not using reopen correctly is sometimes the culprit, you need something like this (taken from the javadocs). IndexRe

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread Sirish Vadala
I had exactly the same requirement to parse and index offline html files. I had written my own HTML scanner using javax.swing.text.html.HTMLEditorKit.Parser. It sounds difficult, but pretty simple and straight forward to implement, a simple 40 line java class did the job for me. shrinath.m wrote:

Re: Analyzer enquiry

2011-03-14 Thread Erick Erickson
Nope, that should do it. Best Erick On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta wrote: > Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use > no stemmers. At the one analyzer I passed a german stop words set to the > constructor and at the other one I passed an engli

Issue with disk space on UNIX

2011-03-14 Thread Sirish Vadala
Hello All: Background: I have a text based search engine implemented in Java using Lucene 3.0. Indexing and re-indexing happens every night at 1 am as a scheduled process. The index size is around 1 gig and is recreated every night. Issues 1. Now I have a peculiar problem that happens only on my

no. of documents with hits vs. no. of hits

2011-03-14 Thread Michael Wiegand
Hi, Does Lucene always count the number of documents with hits matching a query or is it also possible to count the overall number of hits? There would be a difference between the two if within a document there is actually more than one hit. Thank you in advance! Best, Michael

Re: lucene3.0.3 | Special character indexing

2011-03-14 Thread Vinaya Kumar Thimmappa
Hello Ranjit, Can you use the latest luke tool ? It has analyzer section which helps in deciding which analyzer to use based on the input. Hope this helps -vinaya On Monday 14 March 2011 07:18 PM, Ranjit Kumar wrote: Hi, I am creating index using Lucene 3.0.3 *StandardAnalyzer*. when sear

Re: lucene3.0.3 | Special character indexing

2011-03-14 Thread Ian Lea
Google finds http://www.gossamer-threads.com/lists/lucene/java-user/91750which looks like a good starting point. -- Ian. P.S. Plain text emails are preferable. On Mon, Mar 14, 2011 at 1:48 PM, Ranjit Kumar wrote: > Hi, > > I am creating index using Lucene 3.0.3 *StandardAnalyzer*. > > when

lucene3.0.3 | Special character indexing

2011-03-14 Thread Ranjit Kumar
Hi, I am creating index using Lucene 3.0.3 StandardAnalyzer. when searching is made on index using query like C, C# or C++ it gives same result for all these three term. As, I know while creating index analyzer ignore special character and do not create index for same. I have tried KeywordAna

Re: Indexing of multilingual labels

2011-03-14 Thread Paul Libbrecht
Stephane, I think that you have the freedom to put what you want in the stored value of a field. The simplest would even be to make it that the fields that you want to use for display are stored, preformatted, xml-ished, owl-ified, or json-ized, to be separate from the indexed fields (where yo

Re: Analyzer enquiry

2011-03-14 Thread Vasiliki Gkouta
Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use no stemmers. At the one analyzer I passed a german stop words set to the constructor and at the other one I passed an english stop words set. My question was if I have to call any other function of the german analyze

Re: Analyzer enquiry

2011-03-14 Thread Erick Erickson
I don't understand what you're saying here. If you put a stemmer in the constructor, you *are* using it. If you don't specify any stemmer at all, you still have to define different analyzers to use different stop word lists. Can you restate your question? Best Erick On Mon, Mar 14, 2011 at 8:21

Re: Analyzer enquiry

2011-03-14 Thread Vasiliki Gkouta
Thanks a lot for your help Erick! About the fields you mentioned: If I don't use stemmers, except for the constructor argument related to the stop words, is there anything else that I have to modify? Thanks, Vicky Quoting Erick Erickson : StandardAnalyzer works well for most European lang

Re: Indexing of multilingual labels

2011-03-14 Thread Vinaya Kumar Thimmappa
Hello Stephane, I think a better way is to have resource file with different language and store pointer in the index to get to correct resource file ( Something like I18N and L10N approach). Store the internationalised string in index and all related localised string in resource file . Thi

Re: Boost a field in fuzzy query

2011-03-14 Thread Ian Lea
You could build the query up in your program, or that part of it anyway. BooleanQuery bq = new BooleanQuery(); FuzzyQuery fq = new FuzzyQuery(...); fq.setBoost(123f); bq.add(fq); ... This might be a bug in MultiFieldQueryParser - you could provide a test case or, better, a patch. See https://issu