I think the best way to tokening/stem is to use the analyzer directly. for
example:
TokenStream ts = analyzer.tokenStream(field, new StringReader(text));
Token token = null;
while ((token = ts.next()) != null) {
Term newTerm = new Term(field, token.termTe
Hi,
I'm getting an ArrayIndexOutOfBoundsException when I try to create an
instance of IndexSearcher with an FSDirectory.
for IndexSearcher searcher = new IndexSearcher(directory);
I get the following stack trace:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
: yes - i guess this is more or less what i mean. an example are the two
: documents:
:
: 1 - with the titles:
: "http"
: "hypertext transfer protocol"
:
: 2 - with the title:
: "http tunnel"
:
: when i use multi-valued fields and do a search on "http" the title
: score on the second document is hi
hello,
> i can think of two possibilities you might be refering to when you say
> "noise" ... one is that the lengthNorm for docs with many variant
> titles causes matches in those titles to not score as well as
> documents with only one title -- this can be dealt with by overriding
> the lengthNo
: - If I index all of the possible titles in a multivalued field this
: introduces some kind of noise and therefore also bad results. The
: reason is that Lucene concatenates all the values of multi-valued
: fields when searching them. While a single one of this fields may be a
: perfect match thi
Hello,
I would like to use Lucene to index a set of articles, where several
different titles may belong to one single article. Currently I use a
field for the article as well as a multi-valued field for the titles.
My problem is:
- If I index only one of the titles I won't get matches when someo
: By example doc contains 3 times the word "test", and 1 time the word
: "example", and the query was looking for both words, the score for the doc
: should be 4.
:
: But whatever I do, score is 1.
1) this is where Searcher.explain really comes in handy ... it will help
you seewhat is going on.
We ran into a problem when implementing a similar infrastructure using NFS. We
were updating our indexes continuously throughout the day which caused disk
space problems when using NFS. I no longer recall the specific details, but in
our configuration, NFS did not appear to flush stale file hand
Hello:
We are developing a WebSphere application using Lucene. Can we use the
following architecture?
1. Store the index in a NFS file system which mount to all four UNIX
machines
2. The WebSphere application just perform search (read only access to the
index on NFS).
3. One of the four machin
i have indexed files uisng IndexFiles,
how can i add the field to the document using this.
cheers,
trupti mulajkar
MSc Advanced Computer Science
Quoting karl wettin <[EMAIL PROTECTED]>:
>
> 2 maj 2006 kl. 16.11 skrev trupti mulajkar:
> >
> > doc(i).get("contents");
> >
> > i get an only NULL
Thx for ur quick reply.
I will go through it.
Rgds,
Jelda
> -Original Message-
> From: mark harwood [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, May 02, 2006 5:03 PM
> To: java-user@lucene.apache.org
> Subject: RE: OutOfMemoryError while enumerating through
> reader.terms(fieldName)
>
"Category counts" should really be a FAQ entry.
There is no one right solution to prescribe because it
depends on the shape of your data.
For previous discussions/code samples see here:
http://www.mail-archive.com/java-user@lucene.apache.org/msg05123.html
and here for more space-efficient repre
I just got an idea for category counting instead following this BitSet
approach..
I will maintain and array with docIds to cateogy_ids as value.
i.e. documents[docId] =category_id
Which is taking for 1 million docs,around each docid=4
bytes,category_id=4bytes = 8MBytes
And then from user que
Thanks for the Field.setOmitNorms(true) tip!
Regarding the Similarity implementation I am trying to do, somehow it does
not work.
Here's what I understand:
Scorer implementation uses the method defined in Similarity, to compute
score. (the formula expressed in
"http://lucene.apache.org/java/docs
I am trying to implement category count almost similar to CNET approach.
At the initialization time , I am trying to create all these BitSets and
then trying to and them with user query(with a bitset obtained from
queryfilter containing user query)..
This way my application is performant..Don't u
Lucene's fields are case sensitive and I think "contents" is written in
lower case by default.
Cheers,
Frank
-Original Message-
From: trupti mulajkar [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 02, 2006 4:11 PM
To: java-user@lucene.apache.org
Subject: Re: creating indexReader object
>>Any advise is relly welcome.
Don't cache all that data.
You need a minimum of (numUniqueTerms*numDocs)/8 bytes
to hold that info.
Assuming 10,000 unique terms and 1 million docs you'd
need over 1 Gig of RAM.
I suppose the question is what are you trying to
achieve and why can't you use the exis
2 maj 2006 kl. 16.11 skrev trupti mulajkar:
doc(i).get("Contents");
i get an only NULL
any ideas ?
Did you index the field with term vector when you added it to the
document?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Try using luke to see how the document actually is in the index.
http://www.getopt.org/luke/
-Venu
-Original Message-
From: trupti mulajkar [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 02, 2006 7:41 PM
To: java-user@lucene.apache.org
Subject: Re: creating indexReader object
thanx hann
thanx hannes,
but i dont think i made my query clear enough.
i have created the index reader object just the way you mentioned it, but after
that when i try to do create the vectors like term frequency and document
frequency using
doc(i).get("Contents");
i get an only NULL
any ideas ?
cheer
Hi,
IndexReader has some static methods, e.g.
IndexReader reader = IndexReader.open(new File("/index"));
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#open(java.lang.String)
Hannes
trupti mulajkar schrieb:
i am trying to create an object of index reader class
i am trying to create an object of index reader class that reads my index. i
need this to further generate the document and term frequency vectors.
however when i try to print the contents of the documents (doc.get("contents"))
it shows -null .
any suggestions,
if i cant read the contents then i c
Hi,
I just debugged it closely.. Sorry I am getting OutOfMemoryError not because
of reader.terms()
But because of invoking QueryFilter.bits() method for each unique term.
I will try explain u with psuedo code.
while(term != null){
if(term.field().equals(name)){
String termText
I was quickly looking at its web page eariler this day and it looks good so
far! Good news!
However, I have one question: does Kneobase contain any kind of web crawler
functionality (like Nutch) or do I have to feed it with all sources
*manually*? How much can be gathering of web data automated?
Hi,
I am getting OutOfMemoryError , while enumerating through TermEnum after
invoking reader.terms(fieldName).
Just to provide you more information, I have almost 1 unique terms in
field A. I can successfully enumerate around 5000terms but later I am
gettting OutOfMemoryError.
I set jvm max
Hi list,
I’m glad to announce Colaborativa.net has released
Kneobase, an open source "enterprise search" product, based on
Lucene.
Kneobase can accept many data sources as
searchable elements, and can provide search results in multiple formats,
including SOAP, which might make it a
26 matches
Mail list logo