Earl Hood wrote:
>
> Looks like Jericho does what you want already:
> http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html
>
> --ewh
>
I went through their feature list and found that out :)
http://jericho.htmlparser.net/docs/index.html
Thanks Earl :)
This i
On Mon, Mar 14, 2011 at 11:46 PM, shrinath.m wrote:
> I used Jericho and found it extremely simple to start with ...
>
> Just wanted to clarify one thing though.
> Is there some tool that does extract text from HTML without creating the DOM
Looks like Jericho does what you want already:
http://je
I started trying out all your suggestions one by one, thanks to all who
helped.
I used Jericho and found it extremely simple to start with ...
Just wanted to clarify one thing though.
Is there some tool that does extract text from HTML without creating the DOM
?
--
Regards
Shrinath.M
--
View
Thank you for your help!
Best Regards,
Vicky
Quoting Erick Erickson :
Nope, that should do it.
Best
Erick
On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta wrote:
Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use
no stemmers. At the one analyzer I passed a german st
On Mar 14, 2011 5:10 PM, Paul Libbrecht wrote:
Merlin,
the kind of magic such as "prefer an exact match" still has to be programmed.
Searching in a field with double-metaphone analyzer will only compare tokens by
their double-metaphone-results.
You probably want query expansion:
text:picasso
Merlin,
the kind of magic such as "prefer an exact match" still has to be programmed.
Searching in a field with double-metaphone analyzer will only compare tokens by
their double-metaphone-results.
You probably want query expansion:
text:picasso
to be expanded to:
text:picasso^3.0 text.stemm
Hi guys,
Here is my noob question:
I'm trying to do fuzzy search on first name, last name. I'm using
double metaphone analyzer. and i encountered the following problem.
for example, when i search for picasso, "paski" shows up with the same
score as the spelling of "picasso". when i look at
Further to what Erick says, recent versions of lucene hang on to
unclosed readers longer than old versions used to.
lsof -p pid can be useful here: run it and grep for deleted files
still being held open.
--
Ian.
On Mon, Mar 14, 2011 at 6:13 PM, Erick Erickson wrote:
> This sounds like you're
This sounds like you're not closing your index searchers and the file system
is keeping them around. On the Unix box, does hour index space reappear
just by restarting the process?
Not using reopen correctly is sometimes the culprit, you need something like
this (taken from the javadocs).
IndexRe
I had exactly the same requirement to parse and index offline html files. I
had written my own HTML scanner using
javax.swing.text.html.HTMLEditorKit.Parser. It sounds difficult, but pretty
simple and straight forward to implement, a simple 40 line java class did
the job for me.
shrinath.m wrote:
Nope, that should do it.
Best
Erick
On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta wrote:
> Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use
> no stemmers. At the one analyzer I passed a german stop words set to the
> constructor and at the other one I passed an engli
Hello All:
Background:
I have a text based search engine implemented in Java using Lucene 3.0.
Indexing and re-indexing happens every night at 1 am as a scheduled process.
The index size is around 1 gig and is recreated every night.
Issues
1. Now I have a peculiar problem that happens only on my
Hi,
Does Lucene always count the number of documents with hits matching a
query or is it also possible to count the overall number of hits?
There would be a difference between the two if within a document there
is actually more than one hit.
Thank you in advance!
Best,
Michael
Hello Ranjit,
Can you use the latest luke tool ? It has analyzer section which helps
in deciding which analyzer to use based on the input.
Hope this helps
-vinaya
On Monday 14 March 2011 07:18 PM, Ranjit Kumar wrote:
Hi,
I am creating index using Lucene 3.0.3 *StandardAnalyzer*.
when sear
Google finds http://www.gossamer-threads.com/lists/lucene/java-user/91750which
looks like a good starting point.
--
Ian.
P.S. Plain text emails are preferable.
On Mon, Mar 14, 2011 at 1:48 PM, Ranjit Kumar wrote:
> Hi,
>
> I am creating index using Lucene 3.0.3 *StandardAnalyzer*.
>
> when
Hi,
I am creating index using Lucene 3.0.3 StandardAnalyzer.
when searching is made on index using query like C, C# or C++ it gives same
result for all these three term. As, I know while creating index analyzer
ignore special character and do not create index for same. I have tried
KeywordAna
Stephane,
I think that you have the freedom to put what you want in the stored value of a
field.
The simplest would even be to make it that the fields that you want to use for
display are stored, preformatted, xml-ished, owl-ified, or json-ized, to be
separate from the indexed fields (where yo
Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and
use no stemmers. At the one analyzer I passed a german stop words set
to the constructor and at the other one I passed an english stop words
set. My question was if I have to call any other function of the
german analyze
I don't understand what you're saying here. If you put a stemmer in the
constructor, you *are* using it. If you don't specify any stemmer at all, you
still have to define different analyzers to use different stop word lists.
Can you restate your question?
Best
Erick
On Mon, Mar 14, 2011 at 8:21
Thanks a lot for your help Erick! About the fields you mentioned: If I
don't use stemmers, except for the constructor argument related to the
stop words, is there anything else that I have to modify?
Thanks,
Vicky
Quoting Erick Erickson :
StandardAnalyzer works well for most European lang
Hello Stephane,
I think a better way is to have resource file with different language
and store pointer in the index to get to correct resource file (
Something like I18N and L10N approach). Store the internationalised
string in index and all related localised string in resource file .
Thi
You could build the query up in your program, or that part of it anyway.
BooleanQuery bq = new BooleanQuery();
FuzzyQuery fq = new FuzzyQuery(...);
fq.setBoost(123f);
bq.add(fq);
...
This might be a bug in MultiFieldQueryParser - you could provide a
test case or, better, a patch.
See https://issu
22 matches
Mail list logo