NPE in IndexReader

2007-08-21 Thread Eric Louvard

Hello while calling IndexReader.deletedoc(int) I am becomming a NPE.

java.lang.NullPointerException
   at 
org.apache.lucene.index.IndexReader.acquireWriteLock(IndexReader.java:658)
   at 
org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:686)


In the acquireWriteLock methode there is call 
'segmentInfos.getVersion()', but segmentInfos should be 'null'.


I am working with the head revision from SVN.

May someone tell me a work arround.

regards, Éric Louvard.

--
Mit freundlichen Grüßen

i. A. Éric Louvard
HAUK & SASKO Ingenieurgesellschaft mbH
Zettachring 2
D-70567 Stuttgart

Phone: +49 7 11 7 25 89 - 19
Fax: +49 7 11 7 25 89 - 50
E-Mail: [EMAIL PROTECTED]
www: www.hauk-sasko.de
Niederlassung Stuttgart
Firmensitz: Markstr. 77, 44801 Bochum
Registergericht: Amtsgericht Bochum, HRB 2532
Geschäftsführer: Dr.-Ing. Pavol Sasko 






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene suggest

2007-08-21 Thread Heba Farouk
Hello
i would like to implement a suggest implementation (like google suggest) using 
lucene. i actually tried using lucene and it was successfull but i was stuck in 
some point which is returning a list of results to the user that have no 
duplicates. my question is: is there any way i can remove duplicates with that 
r returned from the search in the hits or i should manage it manually ??


thanks in advance


Yours 

Heba

   
-
Moody friends. Drama queens. Your life? Nope! - their life, your story.
 Play Sims Stories at Yahoo! Games. 

RE: lucene suggest

2007-08-21 Thread Ard Schrijvers
Hello Heba,

you need some lucene field that serves as an identifier for your documents that 
are indexed. Then, when re-indexing some documents, you can first use the 
identifier to delete the old indexed documents. You have to take care of this 
yourself. 

Regards Ard

> 
> Hello
> i would like to implement a suggest implementation (like 
> google suggest) using lucene. i actually tried using lucene 
> and it was successfull but i was stuck in some point which is 
> returning a list of results to the user that have no 
> duplicates. my question is: is there any way i can remove 
> duplicates with that r returned from the search in the hits 
> or i should manage it manually ??
> 
> 
> thanks in advance
> 
> 
> Yours 
> 
> Heba
> 
>
> -
> Moody friends. Drama queens. Your life? Nope! - their life, 
> your story.
>  Play Sims Stories at Yahoo! Games. 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene suggest

2007-08-21 Thread Chhabra, Kapil
Hi,
Yes there are ways and workarounds to remove duplicates based on one
field. But, you should not need this if you don't index duplicates at
the first place. Just put a call to "delete" from index right before you
add the document to in.

Best Regards,
Kapil Chhabra

-Original Message-
From: Heba Farouk [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 21, 2007 3:30 PM
To: Lucene mailing list
Subject: lucene suggest

Hello
i would like to implement a suggest implementation (like google suggest)
using lucene. i actually tried using lucene and it was successfull but i
was stuck in some point which is returning a list of results to the user
that have no duplicates. my question is: is there any way i can remove
duplicates with that r returned from the search in the hits or i should
manage it manually ??


thanks in advance


Yours 

Heba

   
-
Moody friends. Drama queens. Your life? Nope! - their life, your story.
 Play Sims Stories at Yahoo! Games. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene suggest

2007-08-21 Thread Heba Farouk
the documents are not duplicated, i mean the hits (assume that 2 documents have 
the same subject but with different authors, so if i'm searching the subject, 
the returned hits will have duplicates )
i was asking if i can remove duplicates from the hits??

thanks in advance

Ard Schrijvers <[EMAIL PROTECTED]> wrote: Hello Heba,

you need some lucene field that serves as an identifier for your documents that 
are indexed. Then, when re-indexing some documents, you can first use the 
identifier to delete the old indexed documents. You have to take care of this 
yourself. 

Regards Ard

> 
> Hello
> i would like to implement a suggest implementation (like 
> google suggest) using lucene. i actually tried using lucene 
> and it was successfull but i was stuck in some point which is 
> returning a list of results to the user that have no 
> duplicates. my question is: is there any way i can remove 
> duplicates with that r returned from the search in the hits 
> or i should manage it manually ??
> 
> 
> thanks in advance
> 
> 
> Yours 
> 
> Heba
> 
>
> -
> Moody friends. Drama queens. Your life? Nope! - their life, 
> your story.
>  Play Sims Stories at Yahoo! Games. 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Yours 

Heba

   
-
Shape Yahoo! in your own image.  Join our Network Research Panel today!

RE: lucene suggest

2007-08-21 Thread Ard Schrijvers
:S 

If you have two hits, you have it two times in your index, simple as that. So, 
you have two lucene Documents with the same subject, but with different 
authors, and your search for subject, well, obviously you get 2 hits.

Check out luke, it will help you understand your index. Furthermore, you 
probably want to index the document ones, and just add to times the same lucene 
field with a different user.

Ard

> 
> the documents are not duplicated, i mean the hits (assume 
> that 2 documents have the same subject but with different 
> authors, so if i'm searching the subject, the returned hits 
> will have duplicates )
> i was asking if i can remove duplicates from the hits??
> 
> thanks in advance
> 
> Ard Schrijvers <[EMAIL PROTECTED]> wrote: Hello Heba,
> 
> you need some lucene field that serves as an identifier for 
> your documents that are indexed. Then, when re-indexing some 
> documents, you can first use the identifier to delete the old 
> indexed documents. You have to take care of this yourself. 
> 
> Regards Ard
> 
> > 
> > Hello
> > i would like to implement a suggest implementation (like 
> > google suggest) using lucene. i actually tried using lucene 
> > and it was successfull but i was stuck in some point which is 
> > returning a list of results to the user that have no 
> > duplicates. my question is: is there any way i can remove 
> > duplicates with that r returned from the search in the hits 
> > or i should manage it manually ??
> > 
> > 
> > thanks in advance
> > 
> > 
> > Yours 
> > 
> > Heba
> > 
> >
> > -
> > Moody friends. Drama queens. Your life? Nope! - their life, 
> > your story.
> >  Play Sims Stories at Yahoo! Games. 
> > 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 
> Yours 
> 
> Heba
> 
>
> -
> Shape Yahoo! in your own image.  Join our Network Research 
> Panel today!
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene suggest

2007-08-21 Thread Jens Grivolla
On 8/21/07, Heba Farouk <[EMAIL PROTECTED]> wrote:
> the documents are not duplicated, i mean the hits (assume that 2 documents 
> have the same subject but with different authors, so if i'm searching the 
> subject, the returned hits will have duplicates )
> i was asking if i can remove duplicates from the hits??

You may not want to work with documents at all (where you have the
duplicates), but rather with the terms in your index directly.  Take a
look at WildcardTermEnum etc.

Ciao,
Jens

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Multiple Documents sharing a common boost

2007-08-21 Thread Raghu Ram
Is it possible to have multiple documents share a common boost?

An example scenario is as follows. The set of documents are clustered into
some set of clusters. Each cluster has a unique clusterId. So each document
has a cluster Id field that associates each document with its cluster. Each
cluster has a property called cluster score. Each document has to be boosted
by its cluster score. The number of clusters is very small in comparison to
the number of documents (around 100 clusters).The cluster score is updated
on a continual basis. So the cluster score cant be stored as the document
boost for each individual document as we end up updating all the documents
boost daily which seems infeasible. We are trying to find out a solution
that is more efficient.

Thank you.


Re: Multiple Documents sharing a common boost

2007-08-21 Thread Erick Erickson
One solution is to keep meta-data in your index. Remember that
documents do not all have to have the same field. So you could
index a document with a single field
"metadatanotafieldinanyotherdoc" that contains, say, a list of
all of your clusters and their boosts. Read this document in at
startup time and cache it away in your server. Thereafter, you have
a set of boosts that can be applied at query time.

Of course this useless if you wanted to boost at index time.
But I know of no way to change the boost of a document
without deleting and readding it with the new boost.

Best
Erick

On 8/21/07, Raghu Ram <[EMAIL PROTECTED]> wrote:
>
> Is it possible to have multiple documents share a common boost?
>
> An example scenario is as follows. The set of documents are clustered into
> some set of clusters. Each cluster has a unique clusterId. So each
> document
> has a cluster Id field that associates each document with its cluster.
> Each
> cluster has a property called cluster score. Each document has to be
> boosted
> by its cluster score. The number of clusters is very small in comparison
> to
> the number of documents (around 100 clusters).The cluster score is updated
> on a continual basis. So the cluster score cant be stored as the document
> boost for each individual document as we end up updating all the documents
> boost daily which seems infeasible. We are trying to find out a solution
> that is more efficient.
>
> Thank you.
>


Re: Multiple Documents sharing a common boost

2007-08-21 Thread Raghu Ram
do you mean to say that we generate a compound query by AND ing the original
query with a query like

( (cluster_id=0)^boost_cluster0 OR (cluster_id=1)^boost_cluster1...) )

But is this not inefficient considering that the number of clusters is in
hundreds ??





On 8/21/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> One solution is to keep meta-data in your index. Remember that
> documents do not all have to have the same field. So you could
> index a document with a single field
> "metadatanotafieldinanyotherdoc" that contains, say, a list of
> all of your clusters and their boosts. Read this document in at
> startup time and cache it away in your server. Thereafter, you have
> a set of boosts that can be applied at query time.
>
> Of course this useless if you wanted to boost at index time.
> But I know of no way to change the boost of a document
> without deleting and readding it with the new boost.
>
> Best
> Erick
>
> On 8/21/07, Raghu Ram <[EMAIL PROTECTED]> wrote:
> >
> > Is it possible to have multiple documents share a common boost?
> >
> > An example scenario is as follows. The set of documents are clustered
> into
> > some set of clusters. Each cluster has a unique clusterId. So each
> > document
> > has a cluster Id field that associates each document with its cluster.
> > Each
> > cluster has a property called cluster score. Each document has to be
> > boosted
> > by its cluster score. The number of clusters is very small in comparison
> > to
> > the number of documents (around 100 clusters).The cluster score is
> updated
> > on a continual basis. So the cluster score cant be stored as the
> document
> > boost for each individual document as we end up updating all the
> documents
> > boost daily which seems infeasible. We are trying to find out a solution
> > that is more efficient.
> >
> > Thank you.
> >
>


Re: Multiple Documents sharing a common boost

2007-08-21 Thread Erick Erickson
Ahhh, I was assuming you didn't need to look at all clusters.
Oops.

That said, the question is really whether this is "good enough"
compared to re-indexing, and only some tests will determine that.
I was surprised at how quickly a *large* number of ORs was
processed by Lucene.

You could also think about implementing a HitCollector that
boosted the raw score of each document based upon the
cluster ID, but be careful not to read the full document in
the HitCollector (you shouldn't have to though, either make
a map early or get creative with filters).

You might find useful information looking through the mail
archive for "faceting", as this seems like a similar
topic.

But I wouldn't go anywhere with anything custom until and
unless I'd satisfied myself that the simple approach of
letting Lucene handle a large set of OR clauses wasn't
performant. Several very bright people put significant
effort in to performance, I'd see if they've already done
the hard part .

Erick

On 8/21/07, Raghu Ram <[EMAIL PROTECTED]> wrote:
>
> do you mean to say that we generate a compound query by AND ing the
> original
> query with a query like
>
> ( (cluster_id=0)^boost_cluster0 OR (cluster_id=1)^boost_cluster1...) )
>
> But is this not inefficient considering that the number of clusters is in
> hundreds ??
>
>
>
>
>
> On 8/21/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> >
> > One solution is to keep meta-data in your index. Remember that
> > documents do not all have to have the same field. So you could
> > index a document with a single field
> > "metadatanotafieldinanyotherdoc" that contains, say, a list of
> > all of your clusters and their boosts. Read this document in at
> > startup time and cache it away in your server. Thereafter, you have
> > a set of boosts that can be applied at query time.
> >
> > Of course this useless if you wanted to boost at index time.
> > But I know of no way to change the boost of a document
> > without deleting and readding it with the new boost.
> >
> > Best
> > Erick
> >
> > On 8/21/07, Raghu Ram <[EMAIL PROTECTED]> wrote:
> > >
> > > Is it possible to have multiple documents share a common boost?
> > >
> > > An example scenario is as follows. The set of documents are clustered
> > into
> > > some set of clusters. Each cluster has a unique clusterId. So each
> > > document
> > > has a cluster Id field that associates each document with its cluster.
> > > Each
> > > cluster has a property called cluster score. Each document has to be
> > > boosted
> > > by its cluster score. The number of clusters is very small in
> comparison
> > > to
> > > the number of documents (around 100 clusters).The cluster score is
> > updated
> > > on a continual basis. So the cluster score cant be stored as the
> > document
> > > boost for each individual document as we end up updating all the
> > documents
> > > boost daily which seems infeasible. We are trying to find out a
> solution
> > > that is more efficient.
> > >
> > > Thank you.
> > >
> >
>


benefit of combining fields into one vs. booleanQuery

2007-08-21 Thread Antoine Baudoux

Hi everyone,

	My question : i have medias with a "title" field and a "caption"  
field and a "keywords" field.


I want to be able to search in those 3 fields at the same time. For  
example, if i search "black car" the boolean query looks like this  
combination of termqueries:


(title=black or keywords=black or caption=black) and (title=car or  
keywords= car or caption= car).


So if "black" is in caption and "car" is in title I must find the media.

I'm afraid that those boolean queries will be slow when there are a  
lot of terms in the query.


I can, at index time add a "fulltext" field to each media that will  
contains title, caption and keywords concatenated.


the query becomes :  (fulltext=black and fulltext=car), much simpler.

But i must still be able to search only in title or only in caption  
or only in keywords, so I must still add the other fields, doubling  
the indexed terms.


Has someone done a similar thing? Is it worth it, or will the First  
boolean query remain fast enough?


Thx,

Antoine




--
Antoine Baudoux
Development Manager
[EMAIL PROTECTED]
Tél.: +32 2 333 58 44
GSM: +32 499 534 538
Fax.: +32 2 648 16 53




Re: benefit of combining fields into one vs. booleanQuery

2007-08-21 Thread Erick Erickson
A three field boolean query isn't very complex, so I don't
think that's a problem. Although it does depend a bit upon
how many terms you allow.

But I'd try the simplest thing first, which would be to put
all the terms in a fulltext field as well as in individual terms.

Then get some performance measurements and some
idea of what the total size of the index will be and make
some decisions at that point.

It's actually remarkably easy to switch from one
of those solutions to the other if performance isn't what
you need. In general, I've had better luck not worrying
about space and going for simple code, *then* changing
things around if there's a problem.

Best
Erick

On 8/21/07, Antoine Baudoux <[EMAIL PROTECTED]> wrote:
>
> Hi everyone,
>
> My question : i have medias with a "title" field and a "caption"
> field and a "keywords" field.
>
> I want to be able to search in those 3 fields at the same time. For
> example, if i search "black car" the boolean query looks like this
> combination of termqueries:
>
> (title=black or keywords=black or caption=black) and (title=car or
> keywords= car or caption= car).
>
> So if "black" is in caption and "car" is in title I must find the media.
>
> I'm afraid that those boolean queries will be slow when there are a
> lot of terms in the query.
>
> I can, at index time add a "fulltext" field to each media that will
> contains title, caption and keywords concatenated.
>
> the query becomes :  (fulltext=black and fulltext=car), much simpler.
>
> But i must still be able to search only in title or only in caption
> or only in keywords, so I must still add the other fields, doubling
> the indexed terms.
>
> Has someone done a similar thing? Is it worth it, or will the First
> boolean query remain fast enough?
>
> Thx,
>
> Antoine
>
>
>
>
> --
> Antoine Baudoux
> Development Manager
> [EMAIL PROTECTED]
> Tél.: +32 2 333 58 44
> GSM: +32 499 534 538
> Fax.: +32 2 648 16 53
>
>
>


Re: NPE in IndexReader

2007-08-21 Thread Michael Busch
Eric Louvard wrote:
> Hello while calling IndexReader.deletedoc(int) I am becomming a NPE.
> 
> java.lang.NullPointerException
>at
> org.apache.lucene.index.IndexReader.acquireWriteLock(IndexReader.java:658)
>at
> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:686)
> 
> In the acquireWriteLock methode there is call
> 'segmentInfos.getVersion()', but segmentInfos should be 'null'.
> 
> I am working with the head revision from SVN.
> 
> May someone tell me a work arround.
> 
> regards, Éric Louvard.
> 

Hi Eric,

could you please provide more information about how exactly you create
the IndexReader? A unit test that hits this exception would be even better!

Thanks,
- Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NPE in IndexReader

2007-08-21 Thread Chris Lu
Hi, Eric, I think I have the same problem.

I found out in latest MultiReader.java, the "SegmentInfos" is set to null.

  public MultiReader(IndexReader[] subReaders) throws IOException {
super(subReaders.length == 0 ? null : subReaders[0].directory(),
  null, false, subReaders);
  }

However, segmentInfos are used in several places, causing NPEs.
For example, in IndexReader.acquireWriteLock(),

  if (SegmentInfos.readCurrentVersion(directory) >
segmentInfos.getVersion())
{

So I think MultiReader.java need some adjustments.


-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes


On 8/21/07, Michael Busch <[EMAIL PROTECTED] > wrote:
>
> Eric Louvard wrote:
> > Hello while calling IndexReader.deletedoc(int) I am becomming a NPE.
> >
> > java.lang.NullPointerException
> >at
> > org.apache.lucene.index.IndexReader.acquireWriteLock(IndexReader.java
> :658)
> >at
> > org.apache.lucene.index.IndexReader.deleteDocument (IndexReader.java
> :686)
> >
> > In the acquireWriteLock methode there is call
> > 'segmentInfos.getVersion()', but segmentInfos should be 'null'.
> >
> > I am working with the head revision from SVN.
> >
> > May someone tell me a work arround.
> >
> > regards, Éric Louvard.
> >
>
> Hi Eric,
>
> could you please provide more information about how exactly you create
> the IndexReader? A unit test that hits this exception would be even
> better!
>
> Thanks,
> - Michael
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: NPE in IndexReader

2007-08-21 Thread Chris Hostetter

: I found out in latest MultiReader.java, the "SegmentInfos" is set to null.

: However, segmentInfos are used in several places, causing NPEs.
: For example, in IndexReader.acquireWriteLock(),

MultiReader was refactored into two classes: MultiReader which is now only
constructed from other readers, and MultiSegmentReader which is what
IndexReader.open returns when a directory contans multiple segments ...
segmentInfos shouldn't be needed in the first case -- and doesn't make
much sense at all.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NPE in IndexReader

2007-08-21 Thread Chris Lu
Right now I am very confused.

I agree segmentInfos are not needed in this case. But it's used in
aquireWriteLock(). What should we do?

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 8/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : I found out in latest MultiReader.java, the "SegmentInfos" is set to
> null.
>
> : However, segmentInfos are used in several places, causing NPEs.
> : For example, in IndexReader.acquireWriteLock(),
>
> MultiReader was refactored into two classes: MultiReader which is now only
> constructed from other readers, and MultiSegmentReader which is what
> IndexReader.open returns when a directory contans multiple segments ...
> segmentInfos shouldn't be needed in the first case -- and doesn't make
> much sense at all.
>
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: NPE in IndexReader

2007-08-21 Thread Chris Hostetter

: I agree segmentInfos are not needed in this case. But it's used in
: aquireWriteLock(). What should we do?

This is one of the reasons why i was suggesting in a java-dev thread that
*all* of the refrences to SegmentInfos be refactored out of IndexReader
and into the subclasses -- any attempt to access the SegmentInfos in a
MultiReader constructed from an IndexReader[] doesn't make sense -- and it
never has.  In past releases, any operation in a MultiReader constructed
from subReaders that attempted to use SegmentInfos (like aquireWriteLock)
would never work properly (at best it would lock the first subReader)

...hmm, looking at this more and getting more confused...

I was going to say that this was all a red herring and couldn't cause the
NPE, since the only time acquireWriteLock is (or ever has been) called is
when the IndexReader knows it owns the directory, but a quick skim of
MultiReader on the trunk to try and find where it tells the super class it
doesn't own the directory made me realize it's not there ... MultiReader
extends MultiSegmentReader and passes info up the chain of super
constructors about ot closing the Directory on close, but there is no info
passed up about not owning the directory.

skimming 2.2 i don't see how this ever worked "corectly" ... it wouldn't
NPE, but it looks like any attempt at doing anything on a MultiReader
would always aqcuire the write lock on the *first* sub reader, then
delegate to the correct subreader.

it appears that this is yet another flaw in the old impl exposed via the
refactoring done so far (and further motivation for continued refactoring
untill IndexReader is a glorified abstract class with no meat in it)

(can someone sanity check my assessment ... i'm a little dizzy from
flipping back and forth between various versions of IndexReader,
MultiReader and MultiSegmentReader)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]